Stop Overpaying for Data Labelers

A platform to automate, manage, and scale data labeling workflows. Process millions of text files and documents up to 20x faster and 90% cheaper.

import sutro as so

import polars as pl

from pydantic import BaseModel


reddit_posts = pl.read_parquet("social-media-posts-small-business-250k.parquet")


system_prompt = """

You will be shown the title of a social media Q&A post (and in some cases the text and comments of the post) about small businesses.

Your job is to assign one or more semantic labels to the post to categorize the question that is being asked. This will help identify and cluster common questions and topics. 

The labels should be 2-3 words long, and formatted in with underscore separators for the words. For example labels might be "tax_deduction", "insurance_costs", "payroll_software", "small_business_accounting", "small_business_loans", etc.

Return the labels in a list.

"""


class Labels(BaseModel):

labels: list[str]

results = so.infer(

    reddit_posts,

    column=["Title: " + "title", "Text: " + "text", "Comments: " + "comments_list"],

    system_prompt=system_prompt,

    model="qwen-3-32b-thinking",

    output_schema=Labels,

)

# aggregate the results by label

results = results.explode("labels").group_by("labels").agg(pl.count()).sort("count", descending=True)

print(results)

┌────────────────────────┬───────┐

│ labels                 ┆ count │

│ ---                    ┆ ---   │

│ str                    ┆ i64   │

╞════════════════════════╪═══════╡

│ marketing_strategy     ┆ 10378 │

│ small_business_startup ┆ 6802  │

│ pricing_strategy       ┆ 6407  │

│ social_media_marketing ┆ 5670  │

│ small_business_loans   ┆ 5472  │

└────────────────────────┴───────┘

A Smarter Way to Label Data

A Smarter Way to Label Data

Accelerate Your Workflows

Get labeled data in hours, not weeks. Our platform parallelizes labeling tasks across thousands of calls to LLMs with purpose-built prompts for massive datasets.

Dramatically Reduce Costs

Up to 90% cost reduction. Use cost-effective open-source models for pre-labeling or automate entire pipelines, slashing your per-label cost.

One Simple SDK

Abstract away brittle infra. Replace complex for-loops, rate limit handling, and backoffs with a few lines of code. Focus on your labeling logic, not orchestration.

Scale with Zero Code Changes

Go from 1,000 samples to 10 million with the same script. Perfect for pre-labeling, automated QA, or generating labels with foundation models.

Built for Any Labeling Task

LLM Pre-labeling

Use models like Qwen3 or Gemma3 to generate initial labels for text classification, NER, or summarization tasks.

Training Data Curation

Rapidly create high-quality training data for machine learning models. Ensure data accuracy and consistency at scale.

Automated Quality Assurance

Run models to review human-labeled data, flagging inconsistencies or errors at a massive scale.

Text Classification

Classify customer feedback, support tickets, or survey responses into predefined categories.

Active Learning Pipelines

Programmatically identify and send the most high-value data points to your human labeling queue.

Data PII Redaction

Scan and redact sensitive information from massive text or image datasets before labeling or training.

Built for Any Labeling Task

LLM Pre-labeling

Use models like Qwen3 or Gemma3 to generate initial labels for text classification, NER, or summarization tasks.

Training Data Curation

Rapidly create high-quality training data for machine learning models. Ensure data accuracy and consistency at scale.

Automated Quality Assurance

Run models to review human-labeled data, flagging inconsistencies or errors at a massive scale.

Text Classification

Classify customer feedback, support tickets, or survey responses into predefined categories.

Active Learning Pipelines

Programmatically identify and send the most high-value data points to your human labeling queue.

Data PII Redaction

Scan and redact sensitive information from massive text or image datasets before labeling or training.

Built for Any Labeling Task

LLM Pre-labeling

Use models like Qwen3 or Gemma3 to generate initial labels for text classification, NER, or summarization tasks.

Training Data Curation

Rapidly create high-quality training data for machine learning models. Ensure data accuracy and consistency at scale.

Automated Quality Assurance

Run models to review human-labeled data, flagging inconsistencies or errors at a massive scale.

Text Classification

Classify customer feedback, support tickets, or survey responses into predefined categories.

Active Learning Pipelines

Programmatically identify and send the most high-value data points to your human labeling queue.

Data PII Redaction

Scan and redact sensitive information from massive text or image datasets before labeling or training.

FAQ

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

What is Sutro?

Do I need to code to use Sutro?

How much can I save using Sutro?

How do I handle rate limits in Sutro?

Can I deploy Sutro within my VPC?

Are open-source LLMs good?

Is my data secure in Sutro?

Can I use custom models in Sutro?

How can I load data into Sutro?

How do I sign up for Sutro?

70%

Faster

1B+

Tokens Per Job

Faster Processing

10X

Faster Job Processing

Stop Waiting on Data. Start Building.

Get access to Sutro and transform your data labeling pipeline.