# The Analytical AI Handbook

A living FAQ to build, measure, optimize, and scale reliable decision models

Source: https://sutro.sh/handbook/llms.txt

---

# The Analytical AI Handbook

A living FAQ to build, measure, optimize, and scale reliable decision models

Source: https://sutro.sh/handbook

Updated: 2026-06-02

## What is Analytical AI?

In late 2022, the "ChatGPT moment" happened. The masses began to understand the general power of foundation models, and developers immediately started incorporating them into a slew of new products.

At the same time a less discussed usage pattern also emerged: data, research, ops, and product teams began using foundation models to process unstructured data and make scaled operational decisions.

Put simply: if the AI's job is to **decide something, rather than create** something, it's analytical AI.

## Why does analytical AI matter?

While the distinction may seem subtle, best-practices for analytical purposes often diverge from other generative use cases. This is for a few primary reasons:

1) Tasks are typically *measurable*. You can create a ground-truth dataset using [expert annotations](/patterns/context/expert-annotation) that can be validated against for correctness. Other generative AI outputs are not directly measurable, which is why you need to build [evals](/patterns/evals) (a special case of analytical AI) to measure them.

2) Tasks are often specific and discriminative, not general and emergent. You use an LLM's autoregressive reasoning and instruction-following capabilities to make decisions, but reduce "creativity" in favor of [consistency](/patterns/consistency). For this reason, the task can often be run on the smallest possible model that's been evaluated for task accuracy, rather than reaching for the largest, maximally-intelligent model.

3) Because analytical AI typically does not involve a transaction with a user, more latency is tolerated - so [batch](/deployment/batch-vs-real-time-inference) and other flexible workload processing models are acceptable, often saving tremendously on costs and overall processing time. This is analogous to OLTP vs. OLAP/map-reduce style data processing.

| Property | Other GenAI | Analytical AI |
| --- | --- | --- |
| Examples | Write text/code, generate images/videos, converse with users  | [Classify](/primitives/classifiers), [extract](/primitives/extractors), [judge](/primitives/judges), normalize, match, score |
| Operational Paradigm | Many different user tasks | One task, many times |
| Interaction Pattern | User-facing, transactional | Typically internal data processing & workflows |
| Model Needs | Maximum intelligence and size subject to cost constraints | Minimum intelligence and size required for accurate task completion |
| Serving & Latency | Low-latency, real-time/online | High-throughput, batch/offline |
| Determinism Expectations | Diverse responses, emergent behavior | Consistency, close-to-deterministic behavior
| User Personas | Consumers, Misc. Professionals | Data Scientists/Engineers, Ops, Evals, Product Analytics |
| Task Supervision | Supervised, interactive | Unsupervised |
| Analog | OLTP Databases, Web Applications | OLAP Databases, Data Pipelines

## Who is this guide for?

- Data, ML, and analytics teams using LLMs to transform unstructured datasets to structured ones
- AI engineers and product managers building [evals](/patterns/evals) and trying to improve the reliability of their AI products
- Operations teams looking to scale the expertise of their domain experts via reliable AI decision models
- Research teams building [judges](/primitives/judges) and other verifiable reward functions

It's also written for us - data, infra, and dev tools nerds who are passionate about expanding the scope of what's possible with data and increasing the leverage of developers.

## Why did we write this guide?

Sutro builds products to support analytical AI, which we see as an early but emerging space. Many of our customers are just getting started building many of these systems, especially now that more AI products are coming online and generating unstructured data that need analytical processing. We spend a lot of time in the trenches with customers, helping them architect, design, improve, and reason through how to build these systems. It can be thought of as an evolving FAQ as we learn alongside our customers.

The goal of this guide is to serve as living reference material for developers who are building analytical AI products, regardless of their choice of tooling (although we hope you'll come talk to us).

## How to Use This Handbook

**[Primitives](/primitives)** covers the core analytical AI workload types.
<br>
**[Patterns](/patterns)** discusses best practices for implementation of the primitives.
<br>
**[Architectures](/architectures)** are higher-level guides to build end-to-end systems.
<br>
**[Deployment](/deployment)** covers operational considerations for production use.

Each page should be useful on its own, and we recommending starting in the pages most applicable to your current needs. If you are just reading primarily out of curiosity, we recommend starting in the Primitives section.

---

# Primitives

Learn the fundamental building blocks of analytical AI systems.

Source: https://sutro.sh/handbook/primitives

Updated: 2026-05-18

This section covers the basic, atomic units of computation that can be composed into useful analytical AI systems.

While some of the content in this section may seem familiar, we try to offer unique and opinionated best practices based on our experience working with real-world, applied AI teams across a variety of domains.

---

# Classifiers

The most flexible and broadly applicable analytical AI primitive

Source: https://sutro.sh/handbook/primitives/classifiers

Updated: 2026-06-02

Classifiers are perhaps the oldest form of AI in existence. The original perceptron demonstration in 1957 (the first neural network) was itself a binary classifier.

Nearly 70 years later, classification remains one of the most important decision models in existence, and underpins the operations of nearly all AI systems in some capacity.

## Why humans love classifiers

For one thing, they're simple to understand. Humans easily process discrete decision units, and it's easy to model a decision problem as a classification problem. Should I walk or drive to work? Should I buy a red, blue, or yellow shirt? Should I hire this person or not?

Another reason: the math is simple. Accuracy is easy to understand - (8/10 correct). Precision/recall and f1 scores are a bit more complex, but foundational for data scientists.

Lastly, they're flexible. The various [types of classifiers](/primitives/classifiers/types-of-classifiers) extend the model to cover nearly any type of real-world decision. We'll cover a few of the canonical ones in this guide.

## Why humans hate classifiers

Classification is often the first time someone has to swallow a hard truth about machine learning: the fact that classifiers not always correct is a feature, not a bug.

If you're making some kind of decision or prediction that can be 100% correct by design, that means you should be building some kind of [deterministic filter](/patterns/consistency/determinism), not a predictive classifier. If there is no deterministic filter you can use, yet you found yourself with a 100% correct classifiers on some sample data - you can reasonably expect it to be overfit to that data and not generalize well to data out of sample.

The idea that a system not working 100% of the time is where many software engineers making a jump to AI engineering often have to make a mental leap. So the goal has to shift: instead of looking for 100% reliability, can you reset expectations to producing a model that is roughly a good as, or better than a [human expert](/patterns/context/expert-annotation) at that same task?

## Where "AI" meets classification

As we mentioned, classifiers far predate modern notions of AI. But what shifted is how they're built, and where time is best spent optimizing their performance.

The rise of foundation models largely changed the way new classification systems get built. In the last generation of machine learning, the goal was to curate large corpuses of training data to create a model from scratch. The architectures were well known, but most of the work revolved around curating the training data itself.

But foundation models are already pre-trained, meaning *someone else* did the work of curating data for you. Some will be relevant to the task at hand, some not. The higher costs for inference reflect what effectively amounts to the utilization of the sum of that useful and not-so-useful pile of data all at once. But regardless, the task shifts from "pre-training" the model, to "post-training" the model - effectively a *steering* exercise.

This means, depending on task complexity, that you only need a small amount of data to steer the model towards the behavior you want. And the good news: this can almost always be done in-context, meaning you don't really need to update the weights of the foundation model itself, but rather just tell it how you want it to behave via prompting or other in-context learning methods we'll discuss in this guide.

## When to (and not to) use Foundation models for classification tasks

Some classification problems are inherently simple, and more or less objective. As we mentioned earlier in the guide, much of the goal of analytical AI is to compress a task into the smallest model possible that performs well on its [evals](/patterns/evals) and production [performance tradeoffs](/deployment/model-selection/performance-tradeoffs).

The core tradeoff is whether the task benefits enough from foundation model capability to justify paying for inference instead of spending time curating data for a smaller, narrower system:

| **Decision pressure** | **Foundation models are less justified when...** | **Foundation models are more justified when...** |
| --- | --- | --- |
| **Time to curate data** | Useful labels are easy to produce, the task is stable, and a small dataset can describe the decision boundary well. | Labels are expensive, sparse, or slow to collect, and the model's pre-training can substitute for a large task-specific corpus. |
| **Inference cost** | The task runs at *extremely* high volume, has tight latency constraints, or the decision is low-value from a business perspective. | The classification is high-value, or important enough that the marginal cost of inference is acceptable. |
| **Task complexity** | The label can be assigned from obvious features, deterministic filters, keyword rules, embeddings, or a small classifier. | The model must handle messy inputs, many edge cases, subjective criteria, or shifting definitions of what the label means. |
| **Need for autoregressive reasoning** | The task is pattern recognition, not reasoning; step-by-step generation adds cost and variance without improving the decision. | The model must interpret context, weigh ambiguous evidence, follow nuanced instructions, or explain why a label applies. |
| **Need for re-training** | Models will be "stable" and not frequently updated against new task understanding or data drift | The task definition is frequently changing or the data distribution continually shifts |

Generally speaking, more and more teams are reaching for foundation models instead of traditional ML for classification. As foundation models continue to improve and inference costs decrease, the justification for time spent on curating data rather than just paying higher inference costs is not seen as worthwhile (ML engineers and expert annotators are expensive!)

---

# Types of Classifiers

Common classifier patterns, including binary, multiclass, multilabel, hierarchical, open-set, and ordinal classifiers.

Source: https://sutro.sh/handbook/primitives/classifiers/types-of-classifiers

Updated: 2026-06-02

For the purposes of this guide, we won't cover every type of classifier - just the ones that are relevant to what we've seen in working with customers.

The first design choice is the shape of the output. Most classifier designs fall into a few common patterns:

| **Type** | **Output** | **Use when...** |
| --- | --- | --- |
| **Binary** | One of two labels | The decision is naturally yes/no, pass/fail, relevant/irrelevant, or otherwise reducible to two outcomes. |
| **Multi-class** | Exactly one label from a fixed set | The categories are mutually exclusive, and every input should map to one best answer. |
| **Multi-label** | Any subset of labels from a fixed set | Multiple labels can apply at once, such as user intents, content topics, risk flags, or product attributes. |
| **Hierarchical** | One or more labels across levels of a taxonomy | The decision has a natural parent-child structure, such as support category -> issue type -> root cause. |
| **Open-set** | A known label, `other`, or a proposed new label | The taxonomy is still evolving, or you expect meaningful inputs that do not fit the current label set. |
| **Ordinal / scoring** | An ordered label, score band, or Likert-style rating | The output has meaningful order, such as severity, quality, risk, urgency, or degree of fit. |

## Binary

The "hello world" of classification. Binary classifiers make one of two possible decisions. If the task can honestly be represented as a yes/no question, start here.

Use binary classifiers when the decision boundary is narrow and easy to reason about. Common label sets include:

- true / false
- pass / fail
- relevant / irrelevant
- should_escalate / should_not_escalate

Avoid binary classifiers when there is a meaningful third state. If a human expert would often say ["unclear", "not enough information", or "partially"](/primitives/classifiers/abstention), forcing the model into a binary label set will usually produce false certainty, or force the model into a decision when it's actually unsure.

## Multiclass

Multiclass classifiers choose exactly one label from a closed set of possible labels. Often, this is the next step up from a binary classifier, or a way to extend a binary decision to include a third class.

Use multiclass classifiers when the labels are mutually exclusive and each input should have one best answer. Some examples:

- true, false, unclear
- pass, fail, insufficient_evidence
- billing_issue, technical_issue, account_issue, product_feedback

Avoid multiclass classifiers when labels can reasonably overlap. For example, a support ticket can be both a billing issue and a cancellation risk. Forcing one label may make downstream routing simpler, but it can erase information the business actually needs.

## Multilabel

Multilabel classifiers choose any subset of labels from a closed set. They are best when one categorical choice will not sufficiently express the decision boundary.

Use multilabel classifiers for [intent detection](/primitives/judges/types), topic tagging, risk tagging, product attributes, and other cases where multiple things can be true at once. An example label set might look like:

- user_asked_follow_up_question
- user_not_satisfied
- user_submitted_claim
- user_filed_case_report
- user_requested_human_support_agent

Avoid multilabel classifiers when the label set is really a hierarchy or when one label should logically exclude another. In those cases, a multilabel task can produce combinations that look valid syntactically but do not make sense operationally.

Multilabel classifiers can also be harder to score precisely, since label coincidence is not truly independent. For example, one label may be more frequently chosen when another label is also chosen. This means aggregate accuracy is often less useful than per-label precision, recall, and review of common co-occurrences.

## Hierarchical

Hierarchical classifiers assign labels across levels of a taxonomy. For example, a support classifier might assign `billing -> refund -> duplicate_charge`, rather than choosing only one flat category.

Use hierarchical classifiers when the domain already has a meaningful parent-child structure. They are useful for support taxonomies, product taxonomies, policy categories, medical or legal issue trees, and other settings where broad categories break down into narrower subtypes.

There are two common implementation patterns:

- Predict every level at once, such as category, subcategory, and root cause.
- Predict stepwise, where the first classifier picks the parent category and later classifiers pick narrower children.

## Open set

This is a special type of multiclass, or (typically) multilabel classifier that actually does not begin with a fixed (closed) set of labels that a model can choose from. Rather, it can suggest new labels and apply them when a label in the set previously doesn't apply.

Open set labeling is a special case of classification that requires delicate post-processing, and can be particularly risky as it carries more model-induced biases than a closed-set task.

Use open-set classifiers when the taxonomy is still evolving, or when you are mining a dataset to discover latent clusters before committing to a closed label set.

Avoid open-set classifiers in production workflows that require stable reporting, routing, or enforcement. A model that can invent labels can also invent slightly different versions of the same label, encode its own biases into the taxonomy, or create categories that are hard for humans to interpret.

Open-set outputs usually need post-processing. At minimum, you should review proposed labels, merge duplicates, normalize wording, and decide which new labels deserve to become part of the closed set.

The most common failure mode is treating model-suggested labels as ground truth. Open-set labeling is useful for discovery, but the taxonomy still needs human ownership.

For production use, open-set labels usually need [expert review](/patterns/context/expert-annotation) before becoming part of a stable taxonomy.

## Ordinal / scoring

Ordinal classifiers assign one label from an ordered set. The labels are discrete, but their order matters.

Use ordinal classifiers when the task is not just asking what kind of thing something is, but how much of some property it has.

Common output shapes include:

- low / medium / high risk
- poor / fair / good / excellent
- 1 / 2 / 3 / 4 / 5 
- not_urgent / somewhat_urgent / very_urgent

Avoid ordinal classifiers when the labels imply more precision than the model or rubric can support. A 1-5 (Likert) score is only useful if humans can reliably agree on what each number means.

---

# Abstention

For when "I don't know" is the best decision a model can make

Source: https://sutro.sh/handbook/primitives/classifiers/abstention

Updated: 2026-06-02

Abstention is the decision to not assign a normal task label. Instead of forcing the classifier to choose from the main label set, you give it a way to say that the input cannot be classified reliably.

This is not a weakness in the classifier design. In many production systems, abstention is what prevents the model from turning ambiguity into false certainty.

## When abstention belongs in the label set

Add an abstention label when a human expert would sometimes refuse to make the decision from the available evidence.

Common abstention labels include:

- unclear
- insufficient_evidence
- not_applicable
- out_of_scope
- needs_human_review

The right label depends on what the downstream system needs to know. `unclear` means the case may be in scope, but the evidence is ambiguous. `out_of_scope` means the input does not belong in the task at all. `needs_human_review` means the model should not be the final decision-maker, even if it has a guess.

## What abstention is for

Abstention is useful when the cost of a wrong label is higher than the cost of deferring the decision.

Good use cases include:

- policy or compliance decisions where false certainty creates risk
- routing tasks where the wrong destination is expensive
- extraction or classification tasks with [incomplete input data](/patterns/context)
- [eval judges](/primitives/judges/types) where the result cannot be determined from the artifact being judged
- classifiers that operate on messy user-generated text, logs, tickets, or conversations

In these cases, the goal is not to maximize the number of classified examples. The goal is to maximize the number of correct, useful decisions the system can make without pretending to know more than it does.

## What abstention is not for

Abstention should not become a junk drawer for every hard example. If too many cases fall into the abstention label, the task may be underspecified, the labels may overlap, or the model may need more context.

Avoid abstention when the downstream workflow truly requires a best-effort guess. For example, if a routing system must always pick the most likely team and can cheaply recover from mistakes, a ranked or fallback route may be more useful than an abstain label.

The most common failure mode is defining abstention too vaguely. If the model is told to use `unclear` whenever it is "not sure", the label will behave inconsistently. The instructions should explain what evidence is missing, what ambiguity matters, and when the model should still choose the best available label.

## Designing abstention criteria

Treat abstention like any other label: define when it applies, when it does not apply, and what should happen next.

| **Abstention label** | **Use when...** | **Do not use when...** |
| --- | --- | --- |
| `unclear` | The input is in scope, but the evidence supports multiple plausible labels. | The model can choose a label using the provided rubric. |
| `insufficient_evidence` | The input is missing information required to make the decision. | The information is present but difficult to interpret. |
| `not_applicable` | The input does not contain the kind of object the classifier is designed to classify. | The input is relevant but belongs to an uncommon class. |
| `out_of_scope` | The input belongs to a different task or domain entirely. | The input is in domain but ambiguous. |
| `needs_human_review` | A wrong automated decision would be costly enough that a [human should decide](/patterns/context/expert-annotation). | The model is merely uncertain in a low-stakes setting. |

One useful test: if the model abstains, a human should be able to understand why and know what to do next. If the abstention label does not trigger a clear downstream action, it may be too vague to be useful.

---

# Are judges classifiers?

Why LLM judges are a special case of classifiers and when they should be treated as a distinct analytical AI primitive.

Source: https://sutro.sh/handbook/primitives/classifiers/are-judges-classifiers

Updated: 2026-06-02

We split out [judges](/primitives/judges) and classifiers into two different primitives. Why?

In many ways, a judge *is* just a classifier (typically [multiclass](/primitives/classifiers/types-of-classifiers)), but it's a special case.

- They typically operate over model outputs as input data
- Their primary purpose is to provide verifiable measurements to otherwise unverifiable model outputs 
- They are specifically is trying to follow the [judgement rubric](/primitives/judges/task-design) of a human expert, therefore requiring autoregressive reasoning capabilities (can't really be built as a traditional ML classifier)
- Judges may be composed of multiple classifiers (composing multi-dimensional rubrics), rather than single field outputs

Judge design, purpose, and application areas often differ from other types of AI classifiers. While not fully dissimilar, we have broken them out into two distinct primitives for the purposes of this guide.

---

# Extractors

Pull relevant fields and spans from unstructured documents

Source: https://sutro.sh/handbook/primitives/extractors

Updated: 2026-06-02

Structured extraction is one of the most important unlocks offered by foundation models. This is especially the case now that modern LLMs can reliably produce JSON-formatted outputs, which was not the case until recently.

## Extraction is a decision problem...

But if models can reliably produce JSON outputs, what's left to solve? Why even discuss extraction models in the context of analytical work?

Well - if you've ever tried to use LLMs for structured extraction, you may have noticed that it's often more of a [decision problem](/primitives/classifiers) for the model than a pure information retrieval problem.

Let's use an example. Let's say we want to extract the company name from the following article title.

> Apple to partner with OpenAI on new ChatGPT integration, per Bloomberg.

Not so simple. We'll probably need to [be more specific](/patterns/consistency/task-specificity): "Extract the company name from the article. If there are multiple companies listed, exclude any companies that are not the primary subject. If there are still multiple companies, choose the subjective pronoun in the title."

Now consider this article title:

> Google, Apple, and Nvidia partner on new self-driving initiative.

Now we've found a new case that breaks our latest policy, and we need to refine again. We could go on and on with these examples, but the main point is that once again, we've realized that we're hit with another decision problem.

## ...and a parsing problem

There are many companies who handle PDF parsing and similar problems that primarily arise upstream of the decision problem. In the examples we used, it's simple plain text we feed into the LLM, but the extraction problem is still ambiguous. [Good extractor design](/primitives/extractors/good-extractor-design) is mostly about constraining this ambiguity.

For the purposes of this guide, we assume that the input data can at least be understood by the decision model you're creating. We won't really cover parsing infrastructure in this guide, but may be able to recommend vendors or models that handle parsing reliably.

---

# Extraction Verification

Ground-truthing extractors that contain free-form text

Source: https://sutro.sh/handbook/primitives/extractors/extraction-verification

Updated: 2026-06-02

A common theme you'll find in this guide is that we prefer creating models with verifiable "correctness", or at least an [expert-annotated dataset](/patterns/context/expert-annotation/what-good-annotations-capture). This enables us to establish a ground-truth, and give us some sort of scoring rubric to improve against.

For closed-set fields, verification often looks more like [classifier evaluation](/primitives/classifiers/types-of-classifiers).

## Free-form text similarity

Extractors can be trickier in this regard, because they're typically allowed to generate free-form text by design. To solve this, we recommend using embeddings which are cheap to run, and widely available as open-source models. The <a href="https://github.com/huggingface/sentence-transformers" target="_blank">sentence-transformers</a> Github package is typically plenty sufficient for such purposes, especially when text spans are short.

Using embeddings for similarity ranking isn't perfect, but using cosine similarity or similar scoring methodologies should yield a score between 0 and 1 that should be sufficient for most measurement purposes.

---

# Good Extractor Design

Unfortunately, we're not talking about Inception

Source: https://sutro.sh/handbook/primitives/extractors/good-extractor-design

Updated: 2026-06-02

Like most topics in this handbook, you should err on the side of simplicity, [atomicity, and task decomposition](/primitives/judges/task-design) when building extractors. Like other decision models, you'll want them staying focused on one decision path at a time if budget and architecture can afford it.

We'll break down what field types should be allowed, how you should try to scope them, and what a task should look like in general.

| **Property** | **Recommendation** | **Rationale** |
| --- | --- | --- |
| **Scope** | When possible, defer to atomicity. That means fewer fields to extract, and less overall decision-making by the model. | Overloading models with too many tasks at once yields inconsistent results, and often forces the need for a larger model. Splitting an extraction task up may allow for overall greater throughput and accuracy. |
| **Schema** | Prefer enums, and [closed-set fields](/primitives/classifiers/types-of-classifiers) when possible. If using free-form text, prefer shorter rather than longer spans (1-2 sentences max if possible). | Using closed-set field types simplifies verification. |
| **Task orientation** | Prefer "contained within" tasks that locate text and data already within the input. If you are making an inference about something contained, a classification task may make more sense. If you're summarizing or abstracting information from an input, a structured extraction task is not really the correct primitive to be using. | Again, using contained-within data is typically easier for [verification](/primitives/extractors/extraction-verification). Abstractive summaries are harder to verify for correctness. |
| **Missingness** | Define when the model should return [`null`, `unknown`, or `not_applicable`](/primitives/classifiers/abstention), and do not force a value when the evidence is absent. | Many extraction failures are hallucinated values caused by schemas that require an answer even when the input does not contain one. |
| **Use of reasoning** | If the task is decision-oriented, having the model emit a structured scratchpad of its decision rationale can improve accuracy and auditability. | As mentioned in the intro, extractors are often decision problems. By giving the model a scratchpad, it has something to self-reference and provides a window into potential failure modes as a developer. |
| **Support gathering** | It can be helpful to force a model to cite where it found information. If documents are long, this can help with verification of task accuracy. | Again, improvement of verification and auditability.  |

---

# Judges

A core unit of analytical AI to scale the judgement of a domain expert.

Source: https://sutro.sh/handbook/primitives/judges

Updated: 2026-05-19

If you are building in AI, you have likely come across a lot of content and products around [evals](/patterns/evals). However, there is not much useful information around [LLM-as-a-judge](/primitives/judges/terminology), which underpins many, if not most, modern evals.

But LLM judges are not just for evals. AI models that make judgment calls, or decisions that would otherwise be handled by a human, unlock a massive amount of utility because they can be run at a cost and scale otherwise impossible to match with real people. But judges are only useful when they can make decisions as good as or better than the humans they are proxying.

This section will cover what judges are, where they're often used, and some principles around [task design](/primitives/judges/task-design) to use them effectively.

---

# Judge Terminology

The core vocabulary used when discussing LLM judges and candidate models.

Source: https://sutro.sh/handbook/primitives/judges/terminology

Updated: 2026-05-19

**LLM-as-a-judge = LLM judge = AI judge = using a non-deterministic, pre-trained model as a proxy for human judgment.** We will use these terms somewhat interchangeably in this handbook.

[Evals](/patterns/evals/approaches) != LLM-as-a-judge. There are a number of methods and tools to evaluate an AI system, and LLM judges are just one of them.

We will use the term **[candidate model](/patterns/evals/static-evals-vs-judges)** to make it clear when we are referring to a model or system that is being evaluated, and to disambiguate it from the judge.

---

# What's in a Judge?

The model, context, input, and output schema that make up a typical LLM judge.

Source: https://sutro.sh/handbook/primitives/judges/anatomy

Updated: 2026-05-19

We recommend building judges comprised of the following components and properties:

| Component | Description | Example | Guidance |
| --- | --- | --- | --- |
| Model | A strong, instruction-tuned LLM. | GPT-5.4-mini, Gemma-4-31B. | Do not overthink the [choice of model](/deployment/model-selection). Most modern LLMs are strong instruction-followers, so any foundation model of sufficient size (we recommend at least 30B total parameters as of this writing) should be able to handle a well-defined judge task. Choose something within the latency and cost budget your application requires. |
| Context | Typically a strong system prompt, with no fine-tuning. | "You are evaluating the outputs of another AI model. Your job is to determine if it helped the customer return their order successfully. Evaluate based on three components..." | We recommend against manual prompt engineering to build judges. Use [human annotations](/patterns/context/expert-annotation) and an automated prompt optimization tool to automatically build a strong system prompt for the judge model you have selected. |
| Input | If used for evals, typically a single user conversation with the model, including inputs and outputs, or an agent trace. If used for other purposes, typically one record of the unstructured or semi-structured data being analyzed. | User: "Can you help me return order ABC12345?" Model: "I would be happy to help. Can you provide confirmation of delivery and the address it was delivered to?" | Make sure to provide all necessary information to a judge, and do not hide evidence that would be useful in making a decision. You can optionally supplement a judge with web search or other external grounding tools, but these can be hard to audit and highly variable in pulling in necessary information. |
| Output Schema | A decision label, ideally binary or ternary, and a rationale. | `{"rationale": "The model asked for all three required components to assist the user with their return.", "label": "pass"}` | Provide an output schema with rationale first, then label second. Frame the task as a single-label classification problem with as few options as possible. [Binary or ternary label sets](/primitives/classifiers/types-of-classifiers) are ideal. Avoid numerical scores when possible; if needed, use a 1-5 Likert scale. Do not ask the model for a [confidence score](/patterns/consistency/confidence-scores). |

---

# Types of Judges

Reliability, quality, sentiment, and intent judges in the judge-design hierarchy.

Source: https://sutro.sh/handbook/primitives/judges/types

Updated: 2026-05-19

## Reliability Judges

This is where most teams do, and should, start. These judges ask basic and critical questions:

1. Did my AI agent or product actually do what it was supposed to do for the user? (pass/fail)
2. Did it break anywhere in its execution? (pass/fail/unknown)
3. Does it require [escalation to a human](/primitives/classifiers/abstention)? (pass/fail)

These are core reliability questions. If your [candidate](/primitives/judges/terminology) is consistently failing against these checks, you have serious upstream problems to address.

They are absolutely critical to get right. Being unsure about the performance of these types of judges can result in an entirely broken product experience that users will likely churn from. Or worse, if you're in a compliance-heavy domain, it could mean legal trouble.

## Quality and User Sentiment Judges

The next level up the ladder is quality judges. This is how you turn a working AI product into a good product. These ask questions like:

1. Can you rate the helpfulness of the agent: not at all helpful, somewhat helpful, or very helpful? (not_helpful/somewhat_helpful/very_helpful)
2. Was the user satisfied with the agent's response? (satisfied/not_satisfied/unclear)
3. Was the information clear, concise, and detailed? (true/false)

## Intent Judges

This is where we see AI teams making the leap from good to great. These judges are used to understand not just the quality of an AI product, but what users are actually trying to do with their product. This tells teams where they should double down and where they should cut scope.

This will mostly look like a binning exercise, not a single-choice outcome. For this, we recommend a [multi-label classifier](/primitives/classifiers/types-of-classifiers) that is permitted to select multiple options, ranging from tens to hundreds of observed user actions.

You might be wondering how to gather these user action class labels. There are several options, but one of the more appealing options is [open-set labeling](/primitives/classifiers/types-of-classifiers), where the model is permitted to suggest new labels itself. Open-set labeling is worth a short guide of its own, so we will leave it alone for now and assume you have a means of gathering intent labels - likely by actually reading over some data and getting a sense of appropriate bins.

When building intent judges, ask questions like:

1. What did this user try to accomplish through my agent or product? (process_a_return, ask_product_support_question, redeem_a_voucher, etc.)
2. What did the user want to accomplish that the agent was unable to fully complete? (checkout_from_cart, process_credit_card, remove_saved_item,  etc.)
3. What product drawbacks can we infer from this support interaction? (unclear_instructions, no_id_verification, improperly_handled_case, etc.)

---

# Judges in Evals: Flip Your Intuition

First-principles responses to common objections about using LLMs to judge LLMs.

Source: https://sutro.sh/handbook/primitives/judges/intuition

Updated: 2026-05-19

If you are like most developers, your first instinct may be to reject the idea of using [non-deterministic](/patterns/consistency/determinism) approaches in settings where reliability counts. This is especially true in AI reliability itself: using a model to judge the results of another model feels like fighting fire with fire.

This typically comes from a handful of credible doubts. Let us combat these concerns from first principles.

{{sutro-callout title="Turn expert judgment into production-grade AI evals." body="Sutro provides infrastructure for expert annotation, optimization, and measurement." cta="See how Sutro works" href="https://sutro.sh/"}}

| Doubt | Rebuttal |
| --- | --- |
| **Intelligence:** there is no good reason to believe another LLM should be smarter or more capable than the model it is evaluating. | The model you are using as a judge is not inherently smarter than the model used to generate the results it is evaluating. But it does not need to be, because a [well-designed judge](/primitives/judges/task-design) is evaluating something much narrower than the task being evaluated. |
| **Subjectivity:** you are asking the judge to perform a subjective analysis on something that would otherwise be decided by the expert opinions of your team. | You can ground LLM judges in [expert judgment](/patterns/context/expert-annotation). We will make the bold claim that Sutro offers the best way to do this. |
| **Coverage:** there is an infinite range of possible inputs to the candidate model, so it is impossible to test against all possible scenarios. | Continuous distributions are part of the nature of building in AI. Even in a world of true AGI, mistakes and edge cases will be abundant. Discrete assertions feel safer, but building AI systems means working in a probabilistic domain. |
| **Non-determinism:** results may not be consistent. The exact same input could result in a different judgment, and mild variations of the same input are even more likely to produce this effect. | We can approximate [consistency](/patterns/consistency) through several inference strategies, and use inconsistency as a tool to understand where we need more coverage. |
| **Measurement:** If the judge is another AI model, how can we measure *it's* accuracy? | You can design your judge to be **verifiable** against a corpus of [expert annotations](/patterns/context/expert-annotation). You can independently measure, calibrate, and optimize the judge's performance against this corpus of annotation using *general rules* so it can be trusted on data it's never seen. |

---

# Good Task Design Is All You Need

The design knobs that make LLM judges more reliable, measurable, and useful.

Source: https://sutro.sh/handbook/primitives/judges/task-design

Updated: 2026-05-19

The [last section](/primitives/judges/intuition) probably reminded you of how much control you yield when building AI systems. But as a good engineer, your job is to design systems around what can be controlled and mitigate the effects of known unknowns.

Fortunately, you have a lot of control over judge design decisions. At Sutro, we refer to this as **task design**. Many of these principles can be reused across the rest of the primitives.

Your available knobs are:

| Knob | Question | Example |
| --- | --- | --- |
| Atomicity | Can the task be decomposed, allowing many judges to evaluate small components of the result rather than one judge evaluating the entire result? | Instead of "Is this result good?", ask "Did the response include a URL for the user if one was requested?" and "Did the model produce clear user instructions?" |
| Structure | Can the task be reduced to a [binary or three-class outcome](/primitives/classifiers/types-of-classifiers)? | `pass` / `fail`; `true` / `false` / `insufficient_evidence`. |
| Specificity | Can each task be defined extremely clearly, such that any smart human or generally strong instruction-following model knows how to complete it? | Instead of "Did the model produce clear user instructions?", define what a clear instruction requires, in order of importance: input language, grammatical clarity, correct understanding of the problem, and so on. |
| Generalization | Instead of defining the task solely using few-shot examples, can it be represented through an abstract constitution of rules that will generalize to examples the judge has not seen? | Instead of only providing examples, define a rule: "When users try to directly provide financial data, the model should refuse to accept it." |
| Measurement | Can you get real, numerical inter-rater reliability metrics between your judge and human experts, including in-sample training data and held-out set performance, all on real human-labeled data? | Human/judge agreement: 85%. Held-out human/judge agreement: 83%. |

The overall goal is task decomposition, model steering, and a general learning approach. Part of the benefit of using pre-trained models is that we can rely on what they already know and fill in last-mile learning gaps rather than starting from scratch.

At Sutro, we use a statistical learning approach that presents [ambiguous cases](/patterns/context/expert-annotation/which-cases-to-annotate) for labeling and steering, and high-confidence cases for auditing. Users provide feedback, and we use automated prompt optimization tooling to abstract strong, general decision rules into a system prompt.

You need to do the work of coming up with a good task design. Sutro provides the infrastructure for [annotation](/patterns/context/expert-annotation), optimization, and [measurement](/patterns/evals).

---

# Patterns

Best-practices and battle-tested strategies for analytical AI.

Source: https://sutro.sh/handbook/patterns

Updated: 2026-05-18

Architecture pages describe how Sutro systems fit together.

Use this section for product surface area, runtime boundaries, integration contracts, data movement, and the reasoning behind major technical choices.

---

# Consistency

Boring as a feature

Source: https://sutro.sh/handbook/patterns/consistency

Updated: 2026-06-02

## Slaying the dragon

In many ways, *consistency* is the hallmark of analytical AI. We hear it all the time from teams we work with. "Why can't I get my model to *consistently* get it to behave like I want it to?"

You don't hear this as much from teams who are specifically looking for creative, emergent behavior from models. In the world of analytical AI, your goal is to manufacture **boring outcomes**: a model that consistently does one thing well. That way you can spend your time on everything that isn't so boring.

The main levers are [task specificity](/patterns/consistency/task-specificity), [parallel sampling](/patterns/consistency/parallel-sampling), and better [confidence signals](/patterns/consistency/confidence-scores).

---

# Don't be fooled by determinism

Why absolute determinism is less useful than measured consistency for real-world analytical AI systems.

Source: https://sutro.sh/handbook/patterns/consistency/determinism

Updated: 2026-06-02

## Don't be fooled by determinism

A lot of research goes into getting models to behave completely deterministically, all the way down to hardware cycles on GPU. This may be useful in academic and low-level research settings where experimental reproducibility is a necessity. But in applied AI, absolute determinism from models isn't very useful. In fact, it's probably damaging to overall result quality. Why?

It's not useful because foundation models receive [unstructured data from an unbounded, infinite range](/patterns/evals/static-evals-vs-judges). Even if you're able to get what appears to be absolutely consistent behavior from a model on a certain input, it's possible that some extremely subtle variation of that input - even so much as an extra comma, misspelling, or word rearrangement - can result in a different output. This is the practicality of real-world data. It's messy, and will constantly surprise you in new ways.

It can be damaging because *some* creativity can help with reasoning. Teams often set model [temperature](/patterns/consistency/temperature) to 0 in the hopes of increasing consistency, but they're somewhat hamstringing a model's ability to think about the task. It's like putting handcuffs on someone trying to complete a jigsaw puzzle. Maybe they can get it done, but they'll be pretty limited in their available range.

---

# Task Specificity

How specific task instructions, examples, and edge-case guidance make analytical AI systems more consistent.

Source: https://sutro.sh/handbook/patterns/consistency/task-specificity

Updated: 2026-06-02

## Task Specificity

{{sutro-callout title="Build consistent models with Sutro" body="Sutro provides annotation and automated prompt optimization infrastructure to create task-specific prompts that induce consistent behavior in production." cta="Learn how Sutro works" href="https://sutro.sh/"}}

The cheapest way to induce consistency is to be extremely specific in your prompting logic. This means defining the operational procedure the model should follow to complete the task, [few-shot examples of what good and bad looks like](/patterns/context/expert-annotation/what-good-annotations-capture), enumerating corner and edge cases, and generally abstracting good rules-of-thumb for problem solving on that task.

Think about it this way: if you were to drop a PhD-level intelligence human into your company and ask them to carry out a repeated task that you already knew how to solve, how would you teach it to them? I suspect you'd write a specific guide - perhaps a 10-page PDF explaining how to perform that task, where it can get tricky, what issues you've come across in the past, and other useful context. Would you sit with them for hours, days, or weeks, teaching them what good and bad looks like? Probably not. Models these days are generally very strong instruction-followers, so just getting specific upfront is typically sufficient.

[Automated prompt optimization](/patterns/context/expert-annotation) can also help create super specific prompts from annotated datasets. We'll discuss that in another section of this guide.

---

# Fine-tuning and RL

When to consider fine-tuning or reinforcement learning for analytical AI tasks, and why prompt optimization should usually come first.

Source: https://sutro.sh/handbook/patterns/consistency/fine-tuning-and-rl

Updated: 2026-06-02

## Fine-tuning and RL

As we've mentioned above, and will mention elsewhere: don't reach for this first most analytical AI use-cases. For one thing it's typically not necessary and a waste of time/money to gather the [data that's needed](/patterns/context/expert-annotation/what-good-annotations-capture) - and for another it locks model behavior into weights which can make further adaptation more difficult. Push [automated prompt optimization](/patterns/consistency/task-specificity) as far as possible before deciding a task needs fine-tuning.

---

# Parallel Sampling

How parallel model samples and majority voting can improve consistency for repeated analytical AI decisions.

Source: https://sutro.sh/handbook/patterns/consistency/parallel-sampling

Updated: 2026-06-02

## Parallel Sampling

Perhaps our favorite trick for increasing consistency is the use of parallel sampling. This is simply setting a model's "n" sampling parameter to >1.

For example, setting n=10 on a [classification task](/primitives/classifiers), and simply taking the majority vote from the results can eliminate the statistical odds of one random "bad" inferences.

It can add an inference cost penalty, but typically not linearly with n because it can reuse cached input tokens effectively. Typically open-source model providers and inference engines expose this parameter via API.

For multi-model voting, see [routers and ensembles](/deployment/model-selection/routers-and-ensembles).

---

# Confidence Scores

How to use confidence signals, agreement checks, and escalation logic without relying on self-reported model confidence.

Source: https://sutro.sh/handbook/patterns/consistency/confidence-scores

Updated: 2026-06-02

## Confidence Scores

AI teams often ask models to report out confidence scores alongside some prediction. This is usually symptomatic of their concern of consistency or reliability.

If you find yourself doing this, it probably means you want to reach for other strategies first to increase consistency. However, reporting out *calibrated* confidence scores can still be extremely useful, especially when used as an escalation measure or to [queue for annotation](/patterns/context/expert-annotation/which-cases-to-annotate).

## Better Sources of Confidence

More useful confidence signals often come from the system around the model:

- **Agreement:** do [parallel samples](/patterns/consistency/parallel-sampling) converge on the same answer?
- **Evidence:** did the model find the facts required to support the answer?
- **Verifier checks:** does a separate [judge](/primitives/judges), rule, or retrieval check confirm the output?
- **Logprobs:** you can *sometimes* rely on cumulative logprobs to measure a models confidence in its result. We won't go into detail here, because it's only situationally useful and can conflict with other best-practices we recommend.

These signals are not perfect either, but they are significantly better than self-reported confidence. A model doesn't always know when it doesn't know - especially when eager to help (just like us humans).

## What Confidence Is For

Confidence should usually drive routing of results. Use it to decide whether the system should accept an answer, [abstain](/primitives/classifiers/abstention), retry, escalate to a human, or gather more context.

For analytical AI, a confidence score is most valuable when it changes what the system does next.

---

# Ensembles

How model ensembles can add useful perspectives, and why they are not always the simplest path to consistent AI behavior.

Source: https://sutro.sh/handbook/patterns/consistency/ensembles

Updated: 2026-06-02

## Ensembles

Using model ensembles can be similar to [parallel sampling](/patterns/consistency/parallel-sampling), but strategically using each model's biases to cancel out and mitigate the overall effects of those biases. Generally speaking teams turn to ensembles when a task is really critical, and they're nervous that a single model will be too biased for that task, or want a hive-mind like intelligence.

This can no doubt be powerful, but the effect is often less increased consistency so much as simply adding more voices in the room. It can be hard to productionize and maintain the behavior of an ensemble, especially if using [closed APIs](/deployment/model-selection/open-source-vs-closed) that are subject to deprecation, time of day quantization, and similar.

We're not saying it's a bad approach. But it's generally not a great way to increase overall consistency so much as surface other useful properties of your task. For deployment-level tradeoffs, see [routers and ensembles](/deployment/model-selection/routers-and-ensembles).

---

# Temperature

How to tune model temperature with evals instead of assuming zero temperature is always best for consistency.

Source: https://sutro.sh/handbook/patterns/consistency/temperature

Updated: 2026-06-02

## Temperature

We briefly mentioned this above, but setting model [temperature to zero](/patterns/consistency/determinism) is not an answer to both consistency AND task reliability. Setting a lower temperature than default may be strategic, but you'll likely want to run [evals](/patterns/evals) against an [annotated dataset](/patterns/context/expert-annotation) to find what the optimal setting is for the model and task combination.

---

# Context

What your model needs to know to get it right.

Source: https://sutro.sh/handbook/patterns/context

Updated: 2026-06-02

Like humans, models need the [right information](/patterns/context/expert-annotation/why-expert-annotations-matter) to get a job done. Also like humans - confusing, ambiguous, or absent instructions and/or evidence prevent a human from a getting task done well.

A significant part of the job of an AI engineer is to ensure a model has the context it needs, which can be gathered and provided in a variety of ways. We'll discuss a few of these options and the pragmatic tradeoffs between them.

---

# Expert Annotations

Model behavior should be grounded in expert-reviewed data, and abstracted into generalized rulesets.

Source: https://sutro.sh/handbook/patterns/context/expert-annotation

Updated: 2026-06-02

Expert annotations are the bridge between a model's generic capabilities and the last-mile judgment required for a specific task. They give teams a concrete way to inspect [model behavior](/patterns/consistency), capture expert corrections, and turn repeated review into better system behavior over time.

Use this section when you need to decide who should review model outputs, [what each annotation should contain](/patterns/context/expert-annotation/what-good-annotations-capture), and [which cases](/patterns/context/expert-annotation/which-cases-to-annotate) are worth spending expert time on.

---

# Why Expert Annotations Matter

Model behavior should be grounded in expert-reviewed data, not guessed at from aggregate benchmarks.

Source: https://sutro.sh/handbook/patterns/context/expert-annotation/why-expert-annotations-matter

Updated: 2026-06-02

## Just read the f***ing data

In working with customers deploying AI systems, one thing is often clear: **nearly all model behavior problems are actually inference-time data problems**.

> If you see a smart model consistently failing at a certain task it's typically not because the model is trained poorly, but rather you are supplying bad/missing instructions or data for the model to use as available [context](/patterns/context).

Therefore, the first reflex to improving task performance should just be tearing into a representative underlying cut of data the task is being run on.

Many teams will go overboard at the outset: buying observability products, automated monitoring tools, or integrating off-the-shelf [eval products](/patterns/evals/static-evals-vs-judges). But more reasonably, you should just find a way to get model inputs and outputs into an interface where you and/or a domain expert can review them. Manually reading over just a handful of results will often provide a massive diagnostic lift to start understand where to use scaled approaches.

## Expert-in-the-loop (EITL)

Often times, the developer of an AI product is not the domain expert of the task that the product aims to augment or automate. Before you start annotating it's important to do an honest read of the situation - are you the one whose judgement the model should be using? If you are not the expert, designate one who is for best results.

## One, or multiple experts?

It may be helpful to have multiple experts in some cases, but it's likely simpler to produce a single expert annotation per case being reviewed. Otherwise you'll be forced to use some sort of post-processing logic for adjudicating disagreements between experts.

So even if multiple voices are in the room, it's best to consolidate their opinions into a single annotation.

## Are expert annotations just static evals?

**No.** Static evals are individual test cases, or expected output contracts with a model. They're the passive, defensive cousin of expert annotations.

Sutro believes in using **expert annotations as the primary learning signal from which model behavior should be derived**. In the literature, this technique is known as [reinforcement learning from human feedback (RLHF)](/patterns/consistency/fine-tuning-and-rl) and is one of the primary ways in which foundation model providers align behavior in the first place. But while the foundation model providers use RLHF to create general model behavior, your goal in collecting annotations is to create the last-mile subjective learnings a model needs to complete a task like an expert would.

---

# What Good Annotations Capture

Useful expert annotations preserve the model input, model output, expert correction, rationale, and metadata needed to improve behavior over time.

Source: https://sutro.sh/handbook/patterns/context/expert-annotation/what-good-annotations-capture

Updated: 2026-06-02

You shouldn't be collecting annotations for them to collect dust in old spreadsheets. If created and used well, these will become the unstructured gold needed to monotonically improve an analytical AI system over time.

A maximally useful expert annotation has the following schema:

| Field | Capture | Why it matters |
| --- | --- | --- |
| Input | The exact input the model received. | Lets you reproduce the case and understand what [context](/patterns/context) the model had. |
| Model output | The model's response, ideally with a terse rationale justifying its response. | Shows both the behavior and the apparent reasoning behind its behavior. |
| Expert correction | The expert-corrected output, if a correction is necessary. | Provides the target behavior the system should learn. |
| Expert rationale | Why the correction is right, especially when the rationale differs from the model's. | Turns a single example into a reasoning artifact that can be later abstracted into a decision rule. |
| Inference metadata | Model used, [system prompt](/patterns/consistency/task-specificity), sampling params, timestamp, and related runtime details. | Keeps the annotation tied to the exact system behavior being reviewed. |
| Expert metadata | Labeler identity, timestamp, and review context. | Supports auditability and disagreement review. |

That's not so scary! 

But it is a lot of work to keep these records clean, versioned, and accessible. Sutro helps with this by acting as an [annotation store](/patterns/context/expert-annotation) that can be used directly to modify model/agent behavior.

---

# Which Cases to Annotate

Annotation quality depends on choosing cases that expose ambiguity, edge behavior, and the expert judgment the model needs to learn.

Source: https://sutro.sh/handbook/patterns/context/expert-annotation/which-cases-to-annotate

Updated: 2026-06-02

{{sutro-callout title="Sutro finds high-signal candidates for annotation automatically." body="Use Sutro to minimize time spent on label curation, so you can back to building." cta="See how Sutro works" href="https://sutro.sh/"}}

## How many annotations are needed?

This is highly task dependent, but the answer is often fewer than you might think. We've seen as few as 10 annotations make extremely meaningful impacts on [task accuracy](/patterns/evals). Generally speaking, collecting something like 30-50 annotations is sufficient to capture a representative sample for a model to "learn" to perform the task like an expert, especially when used with an automated prompt optimization framework.

## How do I find *good* cases for annotation?

This is an age-old question in machine learning, and *significantly* impacts overall time spent on annotation as well as accuracy/task learnability. We said we wouldn't advertise much in this handbook, but this is one of the core value offerings of Sutro. We automated the process of [edge-case discovery](/patterns/evals/where-to-start) so you can minimize time spent on annotation while maximizing improvements in accuracy.

---

# Evals

Patterns for measuring AI system behavior, reliability, and quality before and after release.

Source: https://sutro.sh/handbook/patterns/evals

Updated: 2026-05-19

Evals are the measurement layer for AI systems. They help teams understand whether a candidate model, agent, prompt, workflow, or retrieval system behaves well enough for the job it is meant to do.

Unlike conventional software tests, evals often need to measure behavior over unstructured inputs and probabilistic outputs. That creates an important distinction: **you should never seek or expect 100% test coverage with evals, or you are likely overfitting to a narrow set of cases.**

Rather, your goal should be to seek performance at or above human-level capability, as measured by [expert-grounded judges](/primitives/judges).

## Pages in This Section

- [Evals as Outer Loop](/patterns/evals/outer-loop): how evals fit into AI development and post-deployment monitoring.
- [Eval Patterns](/patterns/evals/approaches): common measurement approaches and how they fit together.
- [Where to Start](/patterns/evals/where-to-start): how to choose the first eval that can change a decision.
- [Static Evals vs. Judges](/patterns/evals/static-evals-vs-judges): why static evals and LLM judges solve different parts of the AI measurement problem.

---

# Evals as Outer Loop

How evals fit into AI development and post-deployment monitoring.

Source: https://sutro.sh/handbook/patterns/evals/outer-loop

Updated: 2026-05-19

Test-driven development has been a long-standing pattern to create reliable software. The premise is simple: first write tests that guarantee system reliability, then build the software that passes the test suite. The process of building the tests around the system can be thought of as *the outer loop*, and the architecture and implementation of the system itself can be thought of as the inner loop.

## Eval-driven development?

AI engineering is not that different; the only caveat is that you're not seeking guarantees. That is the tradeoff we make when reaching for [non-deterministic systems](/patterns/consistency/determinism). So while the idea of eval-driven development has been proposed, we do not wholeheartedly endorse it quite yet.

Why? Because foundation models contain *some* degree of reliability to start with, perhaps even high enough to ship in a low-stakes setting. Thus, evals should serve the purpose of filling in the remaining gaps, not necessarily defining initial behavior. That said, the process of creating evals to identify model failures is extremely important, so incorporating evals early into the development process is highly encouraged.

## Why should I care about evals?

So the purpose of evals, in our opinion:

1. Create a rubric and measurement system from which you can improve an AI system around.
2. Use that measurement system to improve the AI system during initial development and after production deployment.

There's an implicit, uncomfortable truth buried in there. Can you spot it?

That truth: **nearly all AI systems have no real [ground-truth](/patterns/context/expert-annotation)**.

They should answer questions like:

- Is the system reliably doing the task it was built to do?
- What are the common failure modes within my control?
- Are changes to the AI system's behavior, such as [swapping models](/deployment/model-selection), changing prompts, or introducing or removing context, improving or hurting performance?
- Is the system reliable enough for the workflow, user, and risk level it serves?

---

# Eval Approaches

Common AI eval approaches and the role each one plays in measurement.

Source: https://sutro.sh/handbook/patterns/evals/approaches

Updated: 2026-05-19

Most eval systems combine multiple approaches:

## Static Evals

- **Static eval sets:** curated examples that are run repeatedly to catch regressions and compare candidate changes.
- **[LLM judges](/primitives/judges):** model-based evaluators that map unstructured outputs into bounded labels, rationales, or classifications.
- **[Human annotation](/patterns/context/expert-annotation):** expert review used to ground judge behavior, audit model performance, and build trust in measurements.
- **Production sampling:** real traces or records sampled from live usage to discover new failure modes and measure field behavior.
- **[Operational metrics](/deployment):** latency, cost, refusal rate, escalation rate, completion rate, and other system-level signals.

No single eval method is sufficient on its own. Static evals provide repeatability, judges provide scale over unbounded behavior, and human review provides grounding.

---

# Where to Start

How to choose a first eval that is narrow enough to build and useful enough to matter.

Source: https://sutro.sh/handbook/patterns/evals/where-to-start

Updated: 2026-05-19

Start with the smallest eval that can change a decision. For most teams, that means defining a narrow [reliability question](/primitives/judges/types) and building a repeatable way to measure it.

Examples:

- Did the agent complete the user's requested task?
- Did the response contain unsupported claims?
- Did the system follow the required escalation policy?
- Did the [extraction output](/primitives/extractors) include the required fields?
- Did the workflow fail due to [missing context](/patterns/context), bad tool use, or model reasoning?

Once the first measurement is useful, expand coverage by adding more task-specific checks, judge-backed labels, and production samples.

---

# Static Evals vs. Judges

Why LLM judges make sense alongside static evals when evaluating AI systems over unbounded input spaces.

Source: https://sutro.sh/handbook/patterns/evals/static-evals-vs-judges

Updated: 2026-05-19

Consider a simple fact: the input to an LLM is unstructured data, such as text, images, audio, and more, in an infinite, unbounded range. Even if you could [force determinism](/patterns/consistency/determinism), such that the exact same set of input characters always yielded the same result from a model, you would still have an infinite set of input cases to test against.

A static set of test cases can only cover a small, discrete sampled set from this range, which is why most static benchmarks can easily be overfit to and are generally considered faulty. With an [LLM judge](/primitives/judges), you can cover a wide, continuous range of possible inputs. But whatever method you choose, you are sampling from an infinite set of test cases, so it is time to throw out the idea that you will ever have perfect test coverage when building in AI.

Much of the rest of the [judge-design guide](/primitives/judges) is about effectively using an AI judge as a bridge from an unbounded output space to a bounded range that can be validated.

---

# Deployment

Choices to make when your models are ready for action.

Source: https://sutro.sh/handbook/deployment

Updated: 2026-06-02

Deployment pages cover the runtime choices that shape AI system cost, latency, reliability, and operational control.

## Pages in This Section

- [Batch vs. Real-Time Inference](/deployment/batch-vs-real-time-inference): when to run analytical AI workloads as batch jobs instead of real-time APIs.
- [Model Selection](/deployment/model-selection): how to choose a model based on task fit, cost, latency, control, and operational constraints.

---

# Batch vs. Real-Time Inference

Faster, cheaper, better

Source: https://sutro.sh/handbook/deployment/batch-vs-real-time-inference

Updated: 2026-06-02

Most [analytical AI](/) workloads can be run as batch inference jobs. Like historical data processing analogs, batch processing has a number of appealing properties - faster job completion times (via higher throughput on large sets of inputs), lower costs, and lower need for constant resource allocation.

{{sutro-callout title="Use Sutro for Batch Inference" body="Sutro provides a simple, usage-based batch inference service. 1-hour job SLA's, >80% cost reductions, a web app for job observability, simple Python SDK" cta="Try Sutro batch today" href="https://sutro.sh/"}}

So why don't all teams rely on batch inference for non user-facing workloads?

We generally find that it comes down to a few reasons:
1) Real-time inference APIs have been the norm since model companies began offering foundation models as a service. Many teams simply architected their systems around these APIs, and switching costs haven't seemed worthwhile.
2) Many systems are high-volume - meaning many requests arrive over time, but not necessarily batch-oriented (all at once). It's possible to process these requests somewhat asynchronously or queue them to run as a batch, but they're realistically best handled as event-driven.
3) Many teams simply aren't at a scale where batch inference, even if saving >50% on inference costs, would help them materially.

## When batch is better

| Workload characteristic | Batch inference is usually better when... | Real-time inference is usually better when... |
| --- | --- | --- |
| **Latency tolerance** | Results can arrive minutes or hours later without hurting the product or workflow. | The user, system, or transaction is waiting for the result immediately. |
| **Input shape** | Inputs arrive in large groups, scheduled exports, backfills, or periodic refreshes. | Inputs arrive one at a time or continuously as user actions and events happen. |
| **Volume vs. Cost** | The workload is large enough that provider batch discounts, higher utilization, or self-hosted batching materially reduce cost. | Volume is low enough that batching savings do not justify added system complexity. |
| **Failure handling** | Failed items can be retried, inspected, or replayed asynchronously. | Failures need immediate fallback behavior because they block a live experience. |
| **Task orientation** | The job is a data-processing pipeline: enrich records, [classify documents](/primitives/classifiers), [extract fields](/primitives/extractors), score accounts, or reprocess historical data. | The model is part of an interactive product loop, complex agentic process, routing decision, or user-facing assistant. |
| **Capacity planning** | You can schedule work around available compute, rate limits, or cheaper off-peak capacity. | Capacity must be available whenever requests arrive. |
| **Versioning and replay** | You want to re-run the same task over a fixed dataset with a specific [model](/deployment/model-selection), [prompt](/patterns/consistency/task-specificity), and schema version. | The freshest context matters more than reproducible replay over a bounded dataset. |

A simple rule of thumb: use batch inference when the work looks like a data pipeline; use real-time when the model sits inside an interactive user-facing application.

## Batch limitations

There are a handful of system tradeoffs that should be made when considering the use of batch inference. 
1. Often, batch inference is best for single-turn LLM calls, where all of the context is known upfront. If there is some unknown number of tool calls, sandboxed code executions, or other steps that interrupt inference, some of the benefits of high-throughput inference can be defeated or made more challenging to architect around. 
2. Many providers offering batch inference services have 24-hour, or worse, 72-hour SLAs and flaky success guarantees. Not only can this be problematic to design around, it makes prototyping harder, and nearly defeats the possibility of using batch inference services for experimental work. 
3. Most batch inference services are also often second-class products: little observability, rigid handling of data, and inconsistent response times.

## Flex/Async Processing

Although not extremely popular, a middle-ground that some inference providers have started to offer is flexible or asynchronous processing modes. These are one-off calls (rather than batch) that are designed to have longer processing times and run in the background, or with some longer SLA in exchange for lower costs. If your application is event-driven but can tolerate higher latency, this may be a good solution as an alternative to batch.

---

# Model Selection

Selecting the right model for the task at hand.

Source: https://sutro.sh/handbook/deployment/model-selection

Updated: 2026-06-02

## Choose a not-bad model

This is going to be one of the shorter sections of this handbook, because at Sutro we believe model selection is one of the least important decisions an AI engineer will make in building a reliable AI system for the vast majority of [analytical tasks](/), and this calculus will become even easier over time.

## Foundation Models as Operating Systems

Without logging into AWS, can you tell me the exact linux distribution you used in the most recent VM you launched? Probably not - but you probably can tell me what operating system your laptop runs on (MacOS, Windows, or Linux?)

That's because operating systems became a commodity as they improved, and their core functionality is largely identical for developers. For instance, running a webserver or database on a virtual machine today is likely more of a question of application needs and vertical/horizontal scaling than whether the application can run at all.

Similarly, we believe that as instruction-following and data-understanding improve within foundation models, they'll be able to handle virtually any "learnable" unstructured data processing/mapping task. More of an emphasis will be placed on things like application [latency/throughput, scaling, and costs](/deployment/batch-vs-real-time-inference) than choice of base model.

{{sutro-callout title="The end of vibes-based AI engineering?" body="Sutro provides direct measurement infrastructure for a variety of analytical models to end guess-work around task-specific AI." cta="Learn how Sutro measures task accuracy" href="https://sutro.sh/"}}

## Pages in This Section

- [Open-Source vs. Closed Models](/deployment/model-selection/open-source-vs-closed): why analytical AI systems should usually move toward open-source model infrastructure over time.
- [Performance Tradeoffs](/deployment/model-selection/performance-tradeoffs): how to balance model intelligence, latency, throughput, cost, and task reliability.
- [Routers and Ensembles](/deployment/model-selection/routers-and-ensembles): when escalation, voting, or multiple-model approaches are worth the added system complexity.

The practical default is simple: choose one strong enough model, [measure it against real task behavior](/patterns/evals), and only add model-level complexity when the eval loop proves that a simpler setup is insufficient.

---

# Open-Source vs. Closed Models

How to think about provider choice, ecosystem control, and model ownership for analytical AI systems.

Source: https://sutro.sh/handbook/deployment/model-selection/open-source-vs-closed

Updated: 2026-06-02

## Why open-source models are a better choice for analytical AI

Over time, it is becoming increasingly sensible for developers to build AI products on top of open-source foundation models for increased control, flexibility, and lower costs/risks associated with closed ecosystems.

Closed providers may continue to be the best option for consumers, and many "AI apps" may be offered natively via MCP servers through closed models.

But for the audience of this handbook who will generally be using AI inference as a core [data processing primitive](/), we recommend a switch to open-source model ecosystem sooner than later.

## What the choice changes

Provider choice matters less because one model is permanently better than another, and more because it shapes the operating constraints around your system.

Closed model APIs can be useful when you need quick access to frontier capabilities, low infrastructure overhead, or a simple managed endpoint. They are often the fastest way to test whether a task can be solved at all.

Open-source models become more attractive once the [workload is repeatable](/deployment/batch-vs-real-time-inference). At that point, the system benefits from more control over inference, deployment topology, cost structure, data handling, versioning, and tuning.

## The practical default

Start with whatever model lets you build the task and measurement loop quickly. Once the task becomes a production workload, re-evaluate whether closed-provider convenience is still worth the tradeoff.

For analytical AI systems, the long-term direction should usually be toward owning more of the model runtime.

---

# Performance Tradeoffs

How to balance intelligence, latency, throughput, cost, and reliability when choosing a model.

Source: https://sutro.sh/handbook/deployment/model-selection/performance-tradeoffs

Updated: 2026-06-02

## Balance intelligence and performance for the task at hand

There is no one-size-fits-all guide today to recommend which model to choose, so we recommend basing it on a number of factors.

If you're building an analytical AI system, it typically implies a model that will run the [same task, many times](/). You should be able to optimize that model's performance against [strong evals](/patterns/evals) you've built to validate its overall sufficiency.

As of this writing, we recommend using models that are at least ~30B total parameters with internal reasoning capabilities unless cost or latency needs prohibit this size. Underneath this size we've seen noticeable lapses on out-of-distribution tasks, or weaker inference efficiency (due to longer reasoning traces) which defeat most cost-optimization or latency gains. Above this size, there can often be diminishing returns on quality for well-defined tasks.

## What to optimize

Model choice should be evaluated against the production shape of the workload:

- Accuracy on representative task cases
- Throughput and latency requirements
- Cost per successful task completion
- Operational control over batching, scaling, and retries
- Stability across task variants and out-of-distribution inputs

The mistake is optimizing only one of these dimensions in isolation. A cheaper model that needs longer traces, more retries, or more [human review](/patterns/context/expert-annotation) may not actually be cheaper. A larger model that improves quality by a marginal amount may not be worth the latency or serving cost.

## Use evals to choose

Model selection should be made against [task-specific evals](/patterns/evals) rather than general benchmark impressions. The right question is not "which model is best?" It is "which model is sufficient for this workload under the constraints we actually have?"

---

# Routers and Ensembles

When routing, escalation, majority voting, or multiple-model approaches are worth the complexity.

Source: https://sutro.sh/handbook/deployment/model-selection/routers-and-ensembles

Updated: 2026-06-02

## Routers and ensembles

Some teams will take a [routing-based approach](/patterns/consistency/confidence-scores), starting with a small model, and escalating to larger and more powerful models if needed.

Others will use [ensembles](/patterns/consistency/ensembles) - a variety of different models all performing the same task, and then using some kind of majority voting logic to decide on a final answer.

While these approaches can be viable, we recommend choosing [one strong model](/deployment/model-selection/performance-tradeoffs) for the task at hand. In some cases the right answer here may be a mixture-of-experts (MoE) type model, which already uses router/ensemble logic as part of its architecture.

---

# Architectures

Coming soon

Source: https://sutro.sh/handbook/architectures

Updated: 2026-05-18