LLM-as-a-Judge
Evals at Any Scale

LLM-as-a-Judge
Evals at Any Scale

Use test-driven development principles for your model, AI application, or agent performance at all stages - from initial training, to ongoing monitoring, to retroactive analytics.

import sutro as so

import polars as pl

from pydantic import BaseModel


df = pl.read_csv('customer-support-dialogues-20k.csv')


system_prompt = "You are a judge. You will be shown a dialogue between a customer and a customer support agent. Your job is to evaluate the chatbot's helpfulness to the user. Return a score between 0 and 10."


class Evaluation(BaseModel):

score: int


results = so.infer(

df,

column='customer_support_dialogue',

model='gpt-oss-120b',

system_prompt=system_prompt,

output_schema=Evaluation

)


results = so.await_job_completion(results, with_original_df=df)


print(results.head())

┌─────┬─────────────────────────────────────────────────────┬───────┐

│ id  ┆ dialogue                                            ┆ score │

│ --- ┆ ---                                                 ┆ ---   │

│ i64 ┆ str                                                 ┆ i64   │

╞═════╪═════════════════════════════════════════════════════╪═══════╡

│ 0   ┆ Customer Support Agent: Hello, thank you for reach… ┆ 7     │

│ 1   ┆ Customer Support Agent: Hello! Thank you for reach… ┆ 9     │

│ 2   ┆ Maya Chen: Hi there, I'm reaching out about the RP… ┆ 9     │

│ 3   ┆ Customer Support Agent: Hello! Thank you for reach… ┆ 3     │

│ 4   ┆ Customer Support Agent: Hello Morgan! Thank you so… ┆ 10    │

└─────┴─────────────────────────────────────────────────────┴───────┘

Run Evals at Scale. Instantly.

Run Evals at Scale. Instantly.

Get Results in Minutes, Not Days

From simple result scoring to complex judge sweeps, pairwise comparisons, ensemble ranking, backlog evals, drift detection, red-teaming and more. Our platform handles rate limits, retries, and job parallelization.

Reduce Eval Costs by 90%

Optimized job packing for high-throughput inference. Stop overpaying for real-time results. Make comprehensive evals affordable, ongoing, and a first-class citizen in your AI development process.

Focus on Metrics, Not Infra

A simple SDK to submit eval jobs. No API rate limiting. No retries. Go from a notebook script to a massive-scale eval with the same code.

Iteratively Improve Models, AI Apps, and Agents Offline

If you’re training LLMs, building AI apps, or developing agents - you’ll need a way to evaluate their performance.

Read Guide

Testimonials

Sutro’s batch inference was enormously helpful for some of my research. They had no problem scaling to my very large workload, and delivered the best service at the lowest price available.

Sutro’s batch inference was enormously helpful for some of my research. They had no problem scaling to my very large workload, and delivered the best service at the lowest price available.

Charlie Snell

UC Berkley Researcher

Sutro lets our researchers fire off batch inference—whether it’s a thousand samples or a few billion—through one API call. They don’t have to check cluster queues or negotiate priorities; the job runs immediately with a predictable, fast return-time.

Sutro lets our researchers fire off batch inference—whether it’s a thousand samples or a few billion—through one API call. They don’t have to check cluster queues or negotiate priorities; the job runs immediately with a predictable, fast return-time.

Nathan Lile

Nathan Lile

Nathan Lile

CEO, Synthlabs

CEO, Synthlabs

CEO, Synthlabs

From Simple Benchmarks to Complex Evals

From Simple Benchmarks to Complex Evals

Pairwise Comparison

Run side-by-side evals to determine which of two model outputs is superior.

Single Answer Scoring

Evaluate a single model output against a rubric or reference answer on a 1-10 scale.

Safety & Bias Testing

Test models against millions of prompts to find edge cases, hallucinations, or biased responses.

Chain-of-Thought (CoT) Evals

Judge the reasoning process of a model, not just the final answer.

Agent Trajectory Evals

Simulate and judge sweeps of end-to-end agentic processes in one call.

Function-Calling Evals

Validate the accuracy and format of generated function calls and JSON outputs.

FAQ

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

70%

Lower Costs

1B+

Tokens Per Job

10X

Faster Job Processing

Run Comprehensive Evals At Scale

Stop spot-checking. Rigorously evaluate your models.