Use test-driven development principles for your model, AI application, or agent performance at all stages - from initial training, to ongoing monitoring, to retroactive analytics.
import sutro as so
import polars as pl
from pydantic import BaseModel
df = pl.read_csv('customer-support-dialogues-20k.csv')
system_prompt = "You are a judge. You will be shown a dialogue between a customer and a customer support agent. Your job is to evaluate the chatbot's helpfulness to the user. Return a score between 0 and 10."
class Evaluation(BaseModel):
score: int
results = so.infer(
df,
column='customer_support_dialogue',
model='gpt-oss-120b',
system_prompt=system_prompt,
output_schema=Evaluation
)
results = so.await_job_completion(results, with_original_df=df)
print(results.head())
Get Results in Minutes, Not Days
From simple result scoring to complex judge sweeps, pairwise comparisons, ensemble ranking, backlog evals, drift detection, red-teaming and more. Our platform handles rate limits, retries, and job parallelization.
Reduce Eval Costs by 90%
Optimized job packing for high-throughput inference. Stop overpaying for real-time results. Make comprehensive evals affordable, ongoing, and a first-class citizen in your AI development process.
Focus on Metrics, Not Infra
A simple SDK to submit eval jobs. No API rate limiting. No retries. Go from a notebook script to a massive-scale eval with the same code.
Iteratively Improve Models, AI Apps, and Agents Offline
If you’re training LLMs, building AI apps, or developing agents - you’ll need a way to evaluate their performance.
Read Guide
Testimonials

Charlie Snell
UC Berkley Researcher


Pairwise Comparison
Run side-by-side evals to determine which of two model outputs is superior.
Single Answer Scoring
Evaluate a single model output against a rubric or reference answer on a 1-10 scale.
Safety & Bias Testing
Test models against millions of prompts to find edge cases, hallucinations, or biased responses.
Chain-of-Thought (CoT) Evals
Judge the reasoning process of a model, not just the final answer.
Agent Trajectory Evals
Simulate and judge sweeps of end-to-end agentic processes in one call.
Function-Calling Evals
Validate the accuracy and format of generated function calls and JSON outputs.
FAQ
70%
Lower Costs
1B+
Tokens Per Job
10X
Faster Job Processing
Run Comprehensive Evals At Scale
Stop spot-checking. Rigorously evaluate your models.

