Static Evals vs. Judges | Sutro Handbook

Consider a simple fact: the input to an LLM is unstructured data, such as text, images, audio, and more, in an infinite, unbounded range. Even if you could force determinism, such that the exact same set of input characters always yielded the same result from a model, you would still have an infinite set of input cases to test against.

A static set of test cases can only cover a small, discrete sampled set from this range, which is why most static benchmarks can easily be overfit to and are generally considered faulty. With an LLM judge, you can cover a wide, continuous range of possible inputs. But whatever method you choose, you are sampling from an infinite set of test cases, so it is time to throw out the idea that you will ever have perfect test coverage when building in AI.

Much of the rest of the judge-design guide is about effectively using an AI judge as a bridge from an unbounded output space to a bounded range that can be validated.