Judges in Evals: Flip Your Intuition

If you are like most developers, your first instinct may be to reject the idea of using non-deterministic approaches in settings where reliability counts. This is especially true in AI reliability itself: using a model to judge the results of another model feels like fighting fire with fire.

This typically comes from a handful of credible doubts. Let us combat these concerns from first principles.

Doubt	Rebuttal
Intelligence: there is no good reason to believe another LLM should be smarter or more capable than the model it is evaluating.	The model you are using as a judge is not inherently smarter than the model used to generate the results it is evaluating. But it does not need to be, because a well-designed judge is evaluating something much narrower than the task being evaluated.
Subjectivity: you are asking the judge to perform a subjective analysis on something that would otherwise be decided by the expert opinions of your team.	You can ground LLM judges in expert judgment. We will make the bold claim that Sutro offers the best way to do this.
Coverage: there is an infinite range of possible inputs to the candidate model, so it is impossible to test against all possible scenarios.	Continuous distributions are part of the nature of building in AI. Even in a world of true AGI, mistakes and edge cases will be abundant. Discrete assertions feel safer, but building AI systems means working in a probabilistic domain.
Non-determinism: results may not be consistent. The exact same input could result in a different judgment, and mild variations of the same input are even more likely to produce this effect.	We can approximate consistency through several inference strategies, and use inconsistency as a tool to understand where we need more coverage.
Measurement: If the judge is another AI model, how can we measure it's accuracy?	You can design your judge to be verifiable against a corpus of expert annotations. You can independently measure, calibrate, and optimize the judge's performance against this corpus of annotation using general rules so it can be trusted on data it's never seen.