Start with the smallest eval that can change a decision. For most teams, that means defining a narrow reliability question and building a repeatable way to measure it.
Examples:
- Did the agent complete the user's requested task?
- Did the response contain unsupported claims?
- Did the system follow the required escalation policy?
- Did the extraction output include the required fields?
- Did the workflow fail due to missing context, bad tool use, or model reasoning?
Once the first measurement is useful, expand coverage by adding more task-specific checks, judge-backed labels, and production samples.