Break up with
prompt engineering.

Break up with prompt engineering

Build the most accurate, trustworthy judges, classifiers, and extractors in hours, not weeks.

A new way to build AI that works with, not against you.

Pass

Total Count: 5,287

Fail

Total Count: 1,319

PASS

FAIL

45% CONFIDENCE

Apply the student discount.

Do you have a .edu email or verification link?

Pass

Total Count: 5,287

Fail

Total Count: 1,319

PASS

FAIL

45% CONFIDENCE

Apply the student discount.

Do you have a .edu email or verification link?

Pass

Fail

PASS

FAIL

45% CONFIDENCE

Apply the student discount.

Do you have a .edu email or verification link?

Sutro Functions

A new way to quickly build expert-aligned judges, classifiers, and extractors.

Support Agent Judge v1.3

Pass/fail judge for our new customer support agent.

Avg. Confidence

Avg. Confidence

↑6%

↑6%

95%

95%

User / Model Agreement

User / Model Agreement

↑15%

↑15%

75%

75%

Cost/1,000 traces

Cost/1,000 traces

↓75%

↓75%

$0.03

$0.03

Sutro Batch

Run functions at any scale. Offline evals, unstructured data transformation, and more.

Inputs

Outputs

Need help with plan upgrade

New computer login

Contract renewal question

Server updates?

Billing

IT

Sales

Sales

+53,192 more…

Inputs

Outputs

Need help with plan upgrade

New computer login

Contract renewal question

Server updates?

Billing

IT

Sales

Sales

+53,192 more…

Inputs

Outputs

Need help with plan upgrade

New computer login

Contract renewal question

Server updates?

Billing

IT

Sales

Sales

+53,192 more…

Say goodbye to slow, brittle prompt engineering and massive, costly labeling queues

Stop wasting time crafting unstable prompts and manually creating golden sets while staying stuck in eval hell.

|Add system prompt…

Cost: $5

Time: 10m

|Add system prompt…

Cost: $5

Time: 10m

|Add system prompt…

Cost: $5

Time: 10m

And hello to accurate, consistent, and trustworthy decision-making

Sutro auto-labels your data, surfacing only ambiguous cases for last-mile preference learning. Labeling is a breeze - as easy as a left or right swipe.

Cost: $0

Time: 0m

Cost: $0

Time: 0m

Cost: $0

Time: 0m

Functions are life-long learners

Once deployed to production, learning doesn’t end. Use confidence scores to surface new edge cases, data drift, or regressions and send them to a queue for continual learning.

Update Model

Encode Decision Preferences

Uncover Low-confidence Examples

Update Model

Encode Decision Preferences

Uncover Low-confidence Examples

Update Model

Encode Decision Preferences

Uncover Low-confidence Examples

How It Works

Bring unlabeled data,
a simple task definition.

No ground-truth or golden set is needed.

|Add task definition…

Upload rows

|Add task definition…

Upload rows

|Add task definition…

Upload rows

We automatically label your data
and surface the most ambiguous cases.

We use an ensemble of frontier models to label your data, and surface cases where they disagree.

Choose the best decision and rationale or write your own.

We automatically label your data and surface the most ambiguous cases.

Choose the best decision and rationale or add your own.

PASS

FAIL

33% CONFIDENCE

Help me reset my password

I'm not sure I can comply with this.

PASS

FAIL

33% CONFIDENCE

Help me reset my password

I'm not sure I can comply with this.

PASS

FAIL

33% CONFIDENCE

Help me reset my password

I'm not sure I can comply with this.

We compile your decision
preferences and learn your
generalizable rules.

Automatic prompt optimization, oh my.

Unlabeled Data

Unlabeled Data

Unlabeled Data

Loop in your experts

Easily send and receive labeling requests to internal or external teams, empowering everyone in your org to scale their decision making.

Send Data Labeling Request

Joe Smith

Head of Procurement

Kelly Sikema

Technical Support Lead

AP

Annotate Partners

Labeler

Send Data Labeling Request

Joe Smith

Head of Procurement

Kelly Sikema

Technical Support Lead

AP

Annotate Partners

Labeler

Send Data Labeling Request

Joe Smith

Head of Procurement

Kelly Sikema

Technical Support Lead

AP

Annotate Partners

Labeler

Once your task is learned,
we produce an expert model
ready for usage at scale.

Our functions return calibrated, numerical confidence scores so you can fill in any remaining gaps discovered in production.

Additional Learning…

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Agent attempts to help refund user, but transaction is not found.

21% CONFIDENCE

Customer asks about chargeback amount, agent correctly identifies transaction and amount

67% CONFIDENCE

Agent responds with helpful clarifying instructions on shipping details.

92% CONFIDENCE

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Additional Learning…

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Agent attempts to help refund user, but transaction is not found.

21% CONFIDENCE

Customer asks about chargeback amount, agent correctly identifies transaction and amount

67% CONFIDENCE

Agent responds with helpful clarifying instructions on shipping details.

92% CONFIDENCE

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Additional Learning…

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Agent attempts to help refund user, but transaction is not found.

21% CONFIDENCE

Customer asks about chargeback amount, agent correctly identifies transaction and amount

67% CONFIDENCE

Agent responds with helpful clarifying instructions on shipping details.

92% CONFIDENCE

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

The building blocks for confident, high-volume AI

Sutro lets you confidently scale decisions you know you can trust.

LLM-as-a-judge

Build and run high quality automated evals for AI products or agents. When your judges work, your product works.


Great for:

LLM output evaluation

Pass/fail agent traces

QA gates

Classify

Organize unstructured data into one or several pre-defined categories, with confidence scores you can actually trust.

Great for:

Routers

Triaging systems

Semantic filters

Extract

Pull structured spans, keywords, and relevant passages into normalized schemas.


Great for:

Structuring large datasets for analytics

Document retrieval systems

Normalization scripts

Sutro Batch

Align, then scale. Serverless async inference; simple usage-based pricing based on data volume.

Align, then scale. Serverless async inference; simple usage-based pricing based on data volume.

Run Sutro Functions, custom models, and pre-trained LLMs over large datasets with thousands, or millions of inputs.

10x

Faster

5x

Less Expensive

Simple Python SDK compatible with most data tools and dataframe libraries.

FAQ

How is Sutro Function's labeling process better than prompt engineering?

How is a Sutro Function different than prompting a foundation model like GPT, Claude, or Gemini?

How can I assess a Function’s decision confidence?

How can I improve my Function over time?

Can I use Sutro Functions on multimodal data (images, videos, etc.)?

How does Sutro Functions compare to DSPy and other prompt optimization tools?

Can I use Sutro products in my VPC?

Why should I use Sutro Functions and Batch together?

How is Sutro Function's labeling process better than prompt engineering?

How is a Sutro Function different than prompting a foundation model like GPT, Claude, or Gemini?

How can I assess a Function’s decision confidence?

How can I improve my Function over time?

Can I use Sutro Functions on multimodal data (images, videos, etc.)?

How does Sutro Functions compare to DSPy and other prompt optimization tools?

Can I use Sutro products in my VPC?

Why should I use Sutro Functions and Batch together?

How is Sutro Function's labeling process better than prompt engineering?

How is a Sutro Function different than prompting a foundation model like GPT, Claude, or Gemini?

How can I assess a Function’s decision confidence?

How can I improve my Function over time?

Can I use Sutro Functions on multimodal data (images, videos, etc.)?

How does Sutro Functions compare to DSPy and other prompt optimization tools?

Can I use Sutro products in my VPC?

Why should I use Sutro Functions and Batch together?

What Will You Scale with Sutro?