Stop Overpaying for Data Labelers
A platform to automate, manage, and scale data labeling workflows. Process millions of text files and documents up to 20x faster and 90% cheaper.
import sutro as so
import polars as pl
from pydantic import BaseModel
reddit_posts = pl.read_parquet("social-media-posts-small-business-250k.parquet")
system_prompt = """
You will be shown the title of a social media Q&A post (and in some cases the text and comments of the post) about small businesses.
Your job is to assign one or more semantic labels to the post to categorize the question that is being asked. This will help identify and cluster common questions and topics. 
The labels should be 2-3 words long, and formatted in with underscore separators for the words. For example labels might be "tax_deduction", "insurance_costs", "payroll_software", "small_business_accounting", "small_business_loans", etc.
Return the labels in a list.
"""
class Labels(BaseModel):
    labels: list[str]results = so.infer(
    reddit_posts,
    column=["Title: " + "title", "Text: " + "text", "Comments: " + "comments_list"],
    system_prompt=system_prompt,
    model="qwen-3-32b-thinking",
    output_schema=Labels,
)
# aggregate the results by label
results = results.explode("labels").group_by("labels").agg(pl.count()).sort("count", descending=True)
print(results)
┌────────────────────────┬───────┐
│ labels ┆ count │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════════════╪═══════╡
│ marketing_strategy ┆ 10378 │
│ small_business_startup ┆ 6802 │
│ pricing_strategy ┆ 6407 │
│ social_media_marketing ┆ 5670 │
│ small_business_loans ┆ 5472 │
└────────────────────────┴───────┘
Accelerate Your Workflows
Get labeled data in hours, not weeks. Our platform parallelizes labeling tasks across thousands of calls to LLMs with purpose-built prompts for massive datasets.
Dramatically Reduce Costs
Up to 90% cost reduction. Use cost-effective open-source models for pre-labeling or automate entire pipelines, slashing your per-label cost.
One Simple SDK
Abstract away brittle infra. Replace complex for-loops, rate limit handling, and backoffs with a few lines of code. Focus on your labeling logic, not orchestration.
Scale with Zero Code Changes
Go from 1,000 samples to 10 million with the same script. Perfect for pre-labeling, automated QA, or generating labels with foundation models.
FAQ
70%
Faster
1B+
10X
Faster Job Processing
Stop Waiting on Data. Start Building.
Get access to Sutro and transform your data labeling pipeline.