Use Cases

Docs

Pricing

Resources

Researchers

Get Access

Use Cases

Docs

Pricing

Resources

Researchers

Get Access

Extract Structured Data at Scale

A platform for running large-scale data extraction and processing workloads. Turn millions of unstructured documents, web pages, or files into clean, structured datasets up to 20x faster and 90% cheaper.

Get Started

import sutro as so

import polars as pl

from pydantic import BaseModel

clinical_notes = pl.read_csv("clinical-notes.csv")

system_prompt = """

You will be shown a clinical note written by a physician. Your job is to extract the following information from the note:

- patient name

- patient date of birth

- patient diagnosis

"""

class ClinicalNote(BaseModel):

patient_name: str

patient_date_of_birth: str

patient_diagnosis: str

results = so.infer(

clinical_notes,
system_prompt=system_prompt,

model="qwen-3-32b-thinking",

output_schema=ClinicalNote,
)

print(results.head())

┌─────────────────────────────────┬─────────────────┬───────────────────────┬──────────────────────────────┐

│ note ┆ patient_name ┆ patient_date_of_birth ┆ patient_diagnosis │

│ --- ┆ --- ┆ --- ┆ --- │

│ str ┆ str ┆ str ┆ str │

╞═════════════════════════════════╪═════════════════╪═══════════════════════╪══════════════════════════════╡

│ Patient: John D. Miller, DOB 1… ┆ John D. Miller ┆ 1984-07-12 ┆ Bacterial pneumonia │

│ Ms. Sarah Lin, born 1991-03-04… ┆ Sarah Lin ┆ 1991-03-04 ┆ Chronic migraine │

│ Note: Patient James Alvarez (D… ┆ James Alvarez ┆ 1977-11-30 ┆ Type 2 Diabetes Mellitus │

│ Consultation for Emily Nguyen,… ┆ Emily Nguyen ┆ 2000-05-21 ┆ Iron-deficiency anemia │

│ Michael Roberts (born 1965-09-… ┆ Michael Roberts ┆ 1965-09-17 ┆ Right distal radius fracture │

└─────────────────────────────────┴─────────────────┴───────────────────────┴──────────────────────────────┘

From Raw Data to Production Pipeline, Faster

Process Any Unstructured Source

Transform messy, real-world data into clean, structured output. Process millions of academic papers, web pages, log files, or reports with a single API call.

Drastically Reduce Processing Costs

Up to 90% cost reduction. Our efficient job management and optimized resource allocation make large-scale data processing economically viable on any budget.

Simple SDK, No Infrastructure Hell

Forget brittle scripts. Our SDK abstracts away rate limits, backoffs, and parallelization. Replace complex loops, backoffs, and retries with a few lines of code that just work.

Scale Without Code Changes

Run your extraction pipeline on 100 files or 100 million with the same code. Sutro is purpose-built to handle run performantly at any scale.

Turn the Unstructured Web into Your Database

Image with benefits / value props as columns

RAG Pipeline Automation

Chunk and embed millions of documents, PDFs, and internal wikis. Build high-quality, production-ready vector stores for your RAG applications without infrastructure overhead.

RAG Pipeline Automation

Chunk and embed millions of documents, PDFs, and internal wikis. Build high-quality, production-ready vector stores for your RAG applications without infrastructure overhead.

RAG Pipeline Automation

Chunk and embed millions of documents, PDFs, and internal wikis. Build high-quality, production-ready vector stores for your RAG applications without infrastructure overhead.

Web-Scale Data Curation

Build proprietary datasets by scraping millions of web pages. Reliably extract text, product info, or user-generated content for pre-training or fine-tuning models at production scale.

Web-Scale Data Curation

Build proprietary datasets by scraping millions of web pages. Reliably extract text, product info, or user-generated content for pre-training or fine-tuning models at production scale.

Web-Scale Data Curation

Build proprietary datasets by scraping millions of web pages. Reliably extract text, product info, or user-generated content for pre-training or fine-tuning models at production scale.

Training Data Preprocessing

Clean, parse, and structure terabytes of raw logs, chat histories, or unstructured text. Prepare massive, high-quality datasets for model training and fine-tuning.

Training Data Preprocessing

Clean, parse, and structure terabytes of raw logs, chat histories, or unstructured text. Prepare massive, high-quality datasets for model training and fine-tuning.

Training Data Preprocessing

Clean, parse, and structure terabytes of raw logs, chat histories, or unstructured text. Prepare massive, high-quality datasets for model training and fine-tuning.

Content Classification & Moderation

Run inference at scale over your entire data lake. Automatically tag, categorize, and moderate text for safety, search, or analytics.

Content Classification & Moderation

Run inference at scale over your entire data lake or live content streams. Automatically tag, categorize, and moderate text, images, or audio for safety, search, or analytics.

Content Classification & Moderation

Run inference at scale over your entire data lake or live content streams. Automatically tag, categorize, and moderate text, images, or audio for safety, search, or analytics.