Speed up your eval loop. Build a judge you can trust.

Build the most accurate, trustworthy judges, classifiers, and extractors in hours, not weeks.

Zero prompt engineering, fine-tuning, or upfront data labeling required.

Pass

Total Count: 5,287

Fail

Total Count: 1,319

PASS

FAIL

45% CONFIDENCE

Apply the student discount.

Do you have a .edu email or verification link?

Pass

Total Count: 5,287

Fail

Total Count: 1,319

PASS

FAIL

45% CONFIDENCE

Apply the student discount.

Do you have a .edu email or verification link?

Pass

Fail

PASS

FAIL

45% CONFIDENCE

Apply the student discount.

Do you have a .edu email or verification link?

Sutro Functions

A new way to quickly build expert-aligned judges, classifiers, and extractors.

Support Agent Judge v1.3

Pass/fail judge for our new customer support agent.

Avg. Confidence

Avg. Confidence

↑6%

↑6%

95%

95%

User / Model Agreement

User / Model Agreement

↑15%

↑15%

75%

75%

Cost/1,000 traces

Cost/1,000 traces

↓75%

↓75%

$0.03

$0.03

Sutro Batch

Run functions at any scale. Offline evals, unstructured data transformation, and more.

Inputs

Outputs

Need help with plan upgrade

New computer login

Contract renewal question

Server updates?

Billing

IT

Sales

Sales

+53,192 more…

Inputs

Outputs

Need help with plan upgrade

New computer login

Contract renewal question

Server updates?

Billing

IT

Sales

Sales

+53,192 more…

Inputs

Outputs

Need help with plan upgrade

New computer login

Contract renewal question

Server updates?

Billing

IT

Sales

Sales

+53,192 more…

Say goodbye to slow, brittle prompt engineering and massive, costly labeling queues

Sutro auto-labels your data, surfacing only ambiguous cases for last-mile preference learning. Labeling is a breeze - as easy as a left or right swipe.

|Add system prompt…

Cost: $5

Time: 10m

|Add system prompt…

Cost: $5

Time: 10m

|Add system prompt…

Cost: $5

Time: 10m

And hello to accurate, consistent, and trustworthy decision-making

Functions know when they don't know, returning calibrated, numerical confidence scores for reliable gating and escalation workflows.

Cost: $0

Time: 0m

Cost: $0

Time: 0m

Cost: $0

Time: 0m

Functions are life-long learners

Once deployed to production, learning doesn’t end. Use confidence scores to surface new edge cases, data drift, or regressions and send them to a queue for continual learning.

Update Model

Encode Decision Preferences

Uncover Low-confidence Examples

Update Model

Encode Decision Preferences

Uncover Low-confidence Examples

Update Model

Encode Decision Preferences

Uncover Low-confidence Examples

How It Works

Bring unlabeled data,
a simple task definition.

No ground-truth or golden set is needed.

|Add task definition…

Upload rows

|Add task definition…

Upload rows

|Add task definition…

Upload rows

We automatically label your data
and surface the most ambiguous cases.

We use an ensemble of frontier models to label your data, and surface cases where they disagree.

Choose the best decision and rationale or write your own.

We automatically label your data and surface the most ambiguous cases.

Choose the best decision and rationale or add your own.

PASS

FAIL

33% CONFIDENCE

Help me reset my password

I'm not sure I can comply with this.

PASS

FAIL

33% CONFIDENCE

Help me reset my password

I'm not sure I can comply with this.

PASS

FAIL

33% CONFIDENCE

Help me reset my password

I'm not sure I can comply with this.

We compile your decision
preferences and learn your
generalizable rules.

Automatic prompt optimization, oh my.

Unlabeled Data

Unlabeled Data

Unlabeled Data

Once your task is learned,
we produce an expert model
ready for usage at scale.

And with continual learning,

it only gets better from here.

Additional Learning…

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Agent attempts to help refund user, but transaction is not found.

21% CONFIDENCE

Customer asks about chargeback amount, agent correctly identifies transaction and amount

67% CONFIDENCE

Agent responds with helpful clarifying instructions on shipping details.

92% CONFIDENCE

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Additional Learning…

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Agent attempts to help refund user, but transaction is not found.

21% CONFIDENCE

Customer asks about chargeback amount, agent correctly identifies transaction and amount

67% CONFIDENCE

Agent responds with helpful clarifying instructions on shipping details.

92% CONFIDENCE

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Additional Learning…

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Agent attempts to help refund user, but transaction is not found.

21% CONFIDENCE

Customer asks about chargeback amount, agent correctly identifies transaction and amount

67% CONFIDENCE

Agent responds with helpful clarifying instructions on shipping details.

92% CONFIDENCE

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

The building blocks for confident, high-volume AI

Sutro you confidently scale decisions you know you can trust.

LLM-as-a-judge

Build and run high quality automated evals for AI products or agents. Dramatically speed up your eval workflow.


Great for:

LLM output evaluation

Pass/fail agent traces

QA gates

Classify

Organize unstructured data into one or several pre-defined categories, with confidence scores you can actually trust.

Great for:

Routers

Triaging systems

Semantic filters

Extract

Pull structured spans, keywords, and relevant passages into normalized schemas.


Great for:

Structuring large datasets for analytics

Document retrieval systems

Normalization scripts

Sutro Batch

Align, then scale. Serverless async inference; simple usage-based pricing based on data volume.

Align, then scale. Serverless async inference; simple usage-based pricing based on data volume.

Run Sutro Functions, custom models, and pre-trained LLMs over large datasets with thousands, or millions of inputs.

10x

Faster

5x

Less Expensive

Simple Python SDK compatible with most data tools and dataframe libraries.

FAQ

How is Sutro Function's labeling process better than prompt engineering?

How is a Sutro Function different than prompting a foundation model like GPT, Claude, or Gemini?

How can I assess a Function’s decision confidence?

How can I improve my Function over time?

Can I use Sutro Functions on multimodal data (images, videos, etc.)?

How does Sutro Functions compare to DSPy and other prompt optimization tools?

Can I use Sutro products in my VPC?

Why should I use Sutro Functions and Batch together?

How is Sutro Function's labeling process better than prompt engineering?

How is a Sutro Function different than prompting a foundation model like GPT, Claude, or Gemini?

How can I assess a Function’s decision confidence?

How can I improve my Function over time?

Can I use Sutro Functions on multimodal data (images, videos, etc.)?

How does Sutro Functions compare to DSPy and other prompt optimization tools?

Can I use Sutro products in my VPC?

Why should I use Sutro Functions and Batch together?

How is Sutro Function's labeling process better than prompt engineering?

How is a Sutro Function different than prompting a foundation model like GPT, Claude, or Gemini?

How can I assess a Function’s decision confidence?

How can I improve my Function over time?

Can I use Sutro Functions on multimodal data (images, videos, etc.)?

How does Sutro Functions compare to DSPy and other prompt optimization tools?

Can I use Sutro products in my VPC?

Why should I use Sutro Functions and Batch together?

What Will You Scale with Sutro?