skip to content
Back
(Draft)

Testing AI Preferences

/ 3 min read

Updated:

Testing LLM preferences is 80% methodology, 20% benchmarks, and 100% trust issues.

Unlike basic math where answers are right or wrong, preferences are fuzzy and personal.

It’s not just “Does it work?” to “How well does it work, and can we trust it?”

You should define what you are shooting for. Do you want the model to be more helpful, more accurate, more engaging, more informative, more consistent, more concise,…?

I like to set this as clear rules: “The response must be simple and concise.”

Even better if you can give it examples on how it should look like and define what “simple and concise” means in your context.

Loading diagram...
View Source
flowchart TB subgraph Setup ["Initial Setup"] A[Define Goals] --> B[Set Clear Rules] B --> C[Prepare Test Cases] note1[/"Example Goals: - Helpfulness - Accuracy - Consistency"/] A -.-> note1 end subgraph Testing ["Testing Phase"] D[Get Base Model Answers] --> E[Get Trained Model Answers] E --> F[Compare Side by Side] note2[/"Tip: Save responses for future reference"/] E -.-> note2 end subgraph Evaluation ["Evaluation Phase"] G[Manual Review] --> H[AI Judge Review] H --> I[Score Analysis] note3[/"Scoring Scale: 1: Bad 2: Okay 3: Great"/] H -.-> note3 end Setup --> Testing Testing --> Evaluation I --> J{Scores Align?} J -->|Yes| K[Process Complete] J -->|No| L[Refine Rules/Dataset] L --> B style Setup fill:#e1f3d8 style Testing fill:#ffd7d7 style Evaluation fill:#d7e9ff classDef note fill:#fff4dd,stroke-dasharray: 5 5 class note1,note2,note3 note
  1. Pick Test Cases -> Sample questions that might break our preferences -> Get a good mix of easy and tricky cases

  2. Get Answers -> Ask both normal and trained LLMs -> Put their answers side by side

  3. Vibe check -> Look at the outputs side by side and check whether the trained LLM is better -> If not, improve your dataset

  4. Score Them -> Use another LLM as judge -> Have it compare the two answers and give a simple yes or no whether the trained LLM is better -> Also have it score each answer on a simple scale: 1: Bad - breaks our preferences 2: Okay - mostly follows preferences 3: Great - perfectly follows preferences -> See if the judge is consistent in the ranking and the scoring

When the absolute and relative scores are different, it’s usually as sign that your rule and scale are not clear enough or that the model did not improve as much as you expected.

Tips:

  • Start with simple tests first
  • Use strong models (like GPT-4) as judges
  • Mix LLM-based testing with human checks
  • Save good and bad examples to learn from

No test is perfect. Mix different methods to get a good picture.