We recommend building judges comprised of the following components and properties:
| Component | Description | Example | Guidance |
|---|---|---|---|
| Model | A strong, instruction-tuned LLM. | GPT-5.4-mini, Gemma-4-31B. | Do not overthink the choice of model. Most modern LLMs are strong instruction-followers, so any foundation model of sufficient size (we recommend at least 30B total parameters as of this writing) should be able to handle a well-defined judge task. Choose something within the latency and cost budget your application requires. |
| Context | Typically a strong system prompt, with no fine-tuning. | "You are evaluating the outputs of another AI model. Your job is to determine if it helped the customer return their order successfully. Evaluate based on three components..." | We recommend against manual prompt engineering to build judges. Use human annotations and an automated prompt optimization tool to automatically build a strong system prompt for the judge model you have selected. |
| Input | If used for evals, typically a single user conversation with the model, including inputs and outputs, or an agent trace. If used for other purposes, typically one record of the unstructured or semi-structured data being analyzed. | User: "Can you help me return order ABC12345?" Model: "I would be happy to help. Can you provide confirmation of delivery and the address it was delivered to?" | Make sure to provide all necessary information to a judge, and do not hide evidence that would be useful in making a decision. You can optionally supplement a judge with web search or other external grounding tools, but these can be hard to audit and highly variable in pulling in necessary information. |
| Output Schema | A decision label, ideally binary or ternary, and a rationale. | {"rationale": "The model asked for all three required components to assist the user with their return.", "label": "pass"} |
Provide an output schema with rationale first, then label second. Frame the task as a single-label classification problem with as few options as possible. Binary or ternary label sets are ideal. Avoid numerical scores when possible; if needed, use a 1-5 Likert scale. Do not ask the model for a confidence score. |