Most eval systems combine multiple approaches:
Static Evals
- Static eval sets: curated examples that are run repeatedly to catch regressions and compare candidate changes.
- LLM judges: model-based evaluators that map unstructured outputs into bounded labels, rationales, or classifications.
- Human annotation: expert review used to ground judge behavior, audit model performance, and build trust in measurements.
- Production sampling: real traces or records sampled from live usage to discover new failure modes and measure field behavior.
- Operational metrics: latency, cost, refusal rate, escalation rate, completion rate, and other system-level signals.
No single eval method is sufficient on its own. Static evals provide repeatability, judges provide scale over unbounded behavior, and human review provides grounding.