Testing AI Preferences • Nicolay Christopher Gerold

Testing LLM preferences is 80% methodology, 20% benchmarks, and 100% trust issues.

Unlike basic math where answers are right or wrong, preferences are fuzzy and personal.

It’s not just “Does it work?” to “How well does it work, and can we trust it?”

You should define what you are shooting for. Do you want the model to be more helpful, more accurate, more engaging, more informative, more consistent, more concise,…?

I like to set this as clear rules: “The response must be simple and concise.”

Even better if you can give it examples on how it should look like and define what “simple and concise” means in your context.

Loading diagram...

View Source

 
flowchart TB
  subgraph Setup ["Initial Setup"]
      A[Define Goals] --> B[Set Clear Rules]
      B --> C[Prepare Test Cases]

      note1[/"Example Goals:
      - Helpfulness
      - Accuracy
      - Consistency"/]

      A -.-> note1
  end

  subgraph Testing ["Testing Phase"]
      D[Get Base Model Answers] --> E[Get Trained Model Answers]
      E --> F[Compare Side by Side]

      note2[/"Tip: Save responses
      for future reference"/]

      E -.-> note2
  end

  subgraph Evaluation ["Evaluation Phase"]
      G[Manual Review] --> H[AI Judge Review]
      H --> I[Score Analysis]

      note3[/"Scoring Scale:
      1: Bad
      2: Okay
      3: Great"/]

      H -.-> note3
  end

  Setup --> Testing
  Testing --> Evaluation

  I --> J{Scores Align?}
  J -->|Yes| K[Process Complete]
  J -->|No| L[Refine Rules/Dataset]
  L --> B

  style Setup fill:#e1f3d8
  style Testing fill:#ffd7d7
  style Evaluation fill:#d7e9ff

  classDef note fill:#fff4dd,stroke-dasharray: 5 5
  class note1,note2,note3 note

Pick Test Cases -> Sample questions that might break our preferences -> Get a good mix of easy and tricky cases
Get Answers -> Ask both normal and trained LLMs -> Put their answers side by side
Vibe check -> Look at the outputs side by side and check whether the trained LLM is better -> If not, improve your dataset
Score Them -> Use another LLM as judge -> Have it compare the two answers and give a simple yes or no whether the trained LLM is better -> Also have it score each answer on a simple scale: 1: Bad - breaks our preferences 2: Okay - mostly follows preferences 3: Great - perfectly follows preferences -> See if the judge is consistent in the ranking and the scoring

When the absolute and relative scores are different, it’s usually as sign that your rule and scale are not clear enough or that the model did not improve as much as you expected.

Tips:

Start with simple tests first
Use strong models (like GPT-4) as judges
Mix LLM-based testing with human checks
Save good and bad examples to learn from

No test is perfect. Mix different methods to get a good picture.