Testing LLM preferences is 80% methodology, 20% benchmarks, and 100% trust issues.
Unlike basic math where answers are right or wrong, preferences are fuzzy and personal.
It’s not just “Does it work?” to “How well does it work, and can we trust it?”
You should define what you are shooting for. Do you want the model to be more helpful, more accurate, more engaging, more informative, more consistent, more concise,…?
I like to set this as clear rules: “The response must be simple and concise.”
Even better if you can give it examples on how it should look like and define what “simple and concise” means in your context.
-
Pick Test Cases -> Sample questions that might break our preferences -> Get a good mix of easy and tricky cases
-
Get Answers -> Ask both normal and trained LLMs -> Put their answers side by side
-
Vibe check -> Look at the outputs side by side and check whether the trained LLM is better -> If not, improve your dataset
-
Score Them -> Use another LLM as judge -> Have it compare the two answers and give a simple yes or no whether the trained LLM is better -> Also have it score each answer on a simple scale: 1: Bad - breaks our preferences 2: Okay - mostly follows preferences 3: Great - perfectly follows preferences -> See if the judge is consistent in the ranking and the scoring
When the absolute and relative scores are different, it’s usually as sign that your rule and scale are not clear enough or that the model did not improve as much as you expected.
Tips:
- Start with simple tests first
- Use strong models (like GPT-4) as judges
- Mix LLM-based testing with human checks
- Save good and bad examples to learn from
No test is perfect. Mix different methods to get a good picture.