Extract Structured Data at Scale

A platform for running large-scale data extraction and processing workloads. Turn millions of unstructured documents, web pages, or files into clean, structured datasets up to 20x faster and 90% cheaper.

import sutro as so

import polars as pl

from pydantic import BaseModel


clinical_notes = pl.read_csv("clinical-notes.csv")


system_prompt = """

You will be shown a clinical note written by a physician. Your job is to extract the following information from the note:

- patient name

- patient date of birth

- patient diagnosis

"""


class ClinicalNote(BaseModel):

    patient_name: str

    patient_date_of_birth: str

    patient_diagnosis: str


results = so.infer(

    clinical_notes,
    system_prompt=system_prompt,

    model="qwen-3-32b-thinking",

    output_schema=ClinicalNote,
)

print(results.head())

┌─────────────────────────────────┬─────────────────┬───────────────────────┬──────────────────────────────┐

│ note                            ┆ patient_name    ┆ patient_date_of_birth ┆ patient_diagnosis            │

│ ---                             ┆ ---             ┆ ---                   ┆ ---                          │

│ str                             ┆ str             ┆ str                   ┆ str                          │

╞═════════════════════════════════╪═════════════════╪═══════════════════════╪══════════════════════════════╡

│ Patient: John D. Miller, DOB 1… ┆ John D. Miller  ┆ 1984-07-12            ┆ Bacterial pneumonia          │

│ Ms. Sarah Lin, born 1991-03-04… ┆ Sarah Lin       ┆ 1991-03-04            ┆ Chronic migraine             │

│ Note: Patient James Alvarez (D… ┆ James Alvarez   ┆ 1977-11-30            ┆ Type 2 Diabetes Mellitus     │

│ Consultation for Emily Nguyen,… ┆ Emily Nguyen    ┆ 2000-05-21            ┆ Iron-deficiency anemia       │

│ Michael Roberts (born 1965-09-… ┆ Michael Roberts ┆ 1965-09-17            ┆ Right distal radius fracture │

└─────────────────────────────────┴─────────────────┴───────────────────────┴──────────────────────────────┘

From Raw Data to Production Pipeline, Faster

From Raw Data to Production Pipeline, Faster

Process Any Unstructured Source

Transform messy, real-world data into clean, structured output. Process millions of academic papers, web pages, log files, or reports with a single API call.

Drastically Reduce Processing Costs

Up to 90% cost reduction. Our efficient job management and optimized resource allocation make large-scale data processing economically viable on any budget.

Simple SDK, No Infrastructure Hell

Forget brittle scripts. Our SDK abstracts away rate limits, backoffs, and parallelization. Replace complex loops, backoffs, and retries with a few lines of code that just work.

Scale Without Code Changes

Run your extraction pipeline on 100 files or 100 million with the same code. Sutro is purpose-built to handle run performantly at any scale.

Turn the Unstructured Web into Your Database

Image with benefits / value props as columns

Image with benefits / value props as columns


RAG Pipeline Automation

Chunk and embed millions of documents, PDFs, and internal wikis. Build high-quality, production-ready vector stores for your RAG applications without infrastructure overhead.

RAG Pipeline Automation

Chunk and embed millions of documents, PDFs, and internal wikis. Build high-quality, production-ready vector stores for your RAG applications without infrastructure overhead.

RAG Pipeline Automation

Chunk and embed millions of documents, PDFs, and internal wikis. Build high-quality, production-ready vector stores for your RAG applications without infrastructure overhead.

Web-Scale Data Curation

Build proprietary datasets by scraping millions of web pages. Reliably extract text, product info, or user-generated content for pre-training or fine-tuning models at production scale.

Web-Scale Data Curation

Build proprietary datasets by scraping millions of web pages. Reliably extract text, product info, or user-generated content for pre-training or fine-tuning models at production scale.

Web-Scale Data Curation

Build proprietary datasets by scraping millions of web pages. Reliably extract text, product info, or user-generated content for pre-training or fine-tuning models at production scale.

Training Data Preprocessing

Clean, parse, and structure terabytes of raw logs, chat histories, or unstructured text. Prepare massive, high-quality datasets for model training and fine-tuning.

Training Data Preprocessing

Clean, parse, and structure terabytes of raw logs, chat histories, or unstructured text. Prepare massive, high-quality datasets for model training and fine-tuning.

Training Data Preprocessing

Clean, parse, and structure terabytes of raw logs, chat histories, or unstructured text. Prepare massive, high-quality datasets for model training and fine-tuning.

Content Classification & Moderation

Run inference at scale over your entire data lake. Automatically tag, categorize, and moderate text for safety, search, or analytics.

Content Classification & Moderation

Run inference at scale over your entire data lake or live content streams. Automatically tag, categorize, and moderate text, images, or audio for safety, search, or analytics.

Content Classification & Moderation

Run inference at scale over your entire data lake or live content streams. Automatically tag, categorize, and moderate text, images, or audio for safety, search, or analytics.

FAQ

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

70%

Lower Costs

1B+

Tokens Per Job

10X

Faster Job Processing

Faster Processing

Start Analyzing Unstructured Data

Stop wasting time on infrastructure and start analyzing your data. Get access to Sutro and transform your data extraction workflow.