What's in a Judge? | Sutro Handbook

We recommend building judges comprised of the following components and properties:

Component	Description	Example	Guidance
Model	A strong, instruction-tuned LLM.	GPT-5.4-mini, Gemma-4-31B.	Do not overthink the choice of model. Most modern LLMs are strong instruction-followers, so any foundation model of sufficient size (we recommend at least 30B total parameters as of this writing) should be able to handle a well-defined judge task. Choose something within the latency and cost budget your application requires.
Context	Typically a strong system prompt, with no fine-tuning.	"You are evaluating the outputs of another AI model. Your job is to determine if it helped the customer return their order successfully. Evaluate based on three components..."	We recommend against manual prompt engineering to build judges. Use human annotations and an automated prompt optimization tool to automatically build a strong system prompt for the judge model you have selected.
Input	If used for evals, typically a single user conversation with the model, including inputs and outputs, or an agent trace. If used for other purposes, typically one record of the unstructured or semi-structured data being analyzed.	User: "Can you help me return order ABC12345?" Model: "I would be happy to help. Can you provide confirmation of delivery and the address it was delivered to?"	Make sure to provide all necessary information to a judge, and do not hide evidence that would be useful in making a decision. You can optionally supplement a judge with web search or other external grounding tools, but these can be hard to audit and highly variable in pulling in necessary information.
Output Schema	A decision label, ideally binary or ternary, and a rationale.	`{"rationale": "The model asked for all three required components to assist the user with their return.", "label": "pass"}`	Provide an output schema with rationale first, then label second. Frame the task as a single-label classification problem with as few options as possible. Binary or ternary label sets are ideal. Avoid numerical scores when possible; if needed, use a 1-5 Likert scale. Do not ask the model for a confidence score.