Eval Approaches | Sutro Handbook

Most eval systems combine multiple approaches:

Static Evals

Static eval sets: curated examples that are run repeatedly to catch regressions and compare candidate changes.
LLM judges: model-based evaluators that map unstructured outputs into bounded labels, rationales, or classifications.
Human annotation: expert review used to ground judge behavior, audit model performance, and build trust in measurements.
Production sampling: real traces or records sampled from live usage to discover new failure modes and measure field behavior.
Operational metrics: latency, cost, refusal rate, escalation rate, completion rate, and other system-level signals.

No single eval method is sufficient on its own. Static evals provide repeatability, judges provide scale over unbounded behavior, and human review provides grounding.