Evals

Patterns for measuring AI system behavior, reliability, and quality before and after release.

Evals are the measurement layer for AI systems. They help teams understand whether a candidate model, agent, prompt, workflow, or retrieval system behaves well enough for the job it is meant to do.

Unlike conventional software tests, evals often need to measure behavior over unstructured inputs and probabilistic outputs. That creates an important distinction: you should never seek or expect 100% test coverage with evals, or you are likely overfitting to a narrow set of cases.

Rather, your goal should be to seek performance at or above human-level capability, as measured by expert-grounded judges.

Pages in This Section

In This Section