Where to Start

How to choose a first eval that is narrow enough to build and useful enough to matter.

Start with the smallest eval that can change a decision. For most teams, that means defining a narrow reliability question and building a repeatable way to measure it.

Examples:

  • Did the agent complete the user's requested task?
  • Did the response contain unsupported claims?
  • Did the system follow the required escalation policy?
  • Did the extraction output include the required fields?
  • Did the workflow fail due to missing context, bad tool use, or model reasoning?

Once the first measurement is useful, expand coverage by adding more task-specific checks, judge-backed labels, and production samples.