Types of Judges | Sutro Handbook

Reliability Judges

This is where most teams do, and should, start. These judges ask basic and critical questions:

Did my AI agent or product actually do what it was supposed to do for the user? (pass/fail)
Did it break anywhere in its execution? (pass/fail/unknown)
Does it require escalation to a human? (pass/fail)

These are core reliability questions. If your candidate is consistently failing against these checks, you have serious upstream problems to address.

They are absolutely critical to get right. Being unsure about the performance of these types of judges can result in an entirely broken product experience that users will likely churn from. Or worse, if you're in a compliance-heavy domain, it could mean legal trouble.

Quality and User Sentiment Judges

The next level up the ladder is quality judges. This is how you turn a working AI product into a good product. These ask questions like:

Can you rate the helpfulness of the agent: not at all helpful, somewhat helpful, or very helpful? (not_helpful/somewhat_helpful/very_helpful)
Was the user satisfied with the agent's response? (satisfied/not_satisfied/unclear)
Was the information clear, concise, and detailed? (true/false)

Intent Judges

This is where we see AI teams making the leap from good to great. These judges are used to understand not just the quality of an AI product, but what users are actually trying to do with their product. This tells teams where they should double down and where they should cut scope.

This will mostly look like a binning exercise, not a single-choice outcome. For this, we recommend a multi-label classifier that is permitted to select multiple options, ranging from tens to hundreds of observed user actions.

You might be wondering how to gather these user action class labels. There are several options, but one of the more appealing options is open-set labeling, where the model is permitted to suggest new labels itself. Open-set labeling is worth a short guide of its own, so we will leave it alone for now and assume you have a means of gathering intent labels - likely by actually reading over some data and getting a sense of appropriate bins.

When building intent judges, ask questions like:

What did this user try to accomplish through my agent or product? (process_a_return, ask_product_support_question, redeem_a_voucher, etc.)
What did the user want to accomplish that the agent was unable to fully complete? (checkout_from_cart, process_credit_card, remove_saved_item, etc.)
What product drawbacks can we infer from this support interaction? (unclear_instructions, no_id_verification, improperly_handled_case, etc.)