Pricing

Documentation

Blog

Case Studies

Get Access

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Get Started

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Get Started

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

Loading Models

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Content analysis

Unlock insights from massive content libraries in hours

Easily sift through thousands of product reviews, crawl millions of web pages, or analyze your entire content corpus to extract structured insights that drive business decisions.

Get Access

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

Loading Models

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Get Started

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

Loading Models

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Get Started

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

Loading Models

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Idea to Insights, Simplified

Sutro takes the pain away from testing and scaling your content analysis workflows to unblock your most ambitious AI projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype

Start small and iterate fast on your analysis prompts and schemas. Shorten development cycles by getting feedback from large batch jobs in minutes before committing to your entire dataset.

Scale

Scale your content analysis workflows so your team can do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing data workflows. Sutro's Python SDK is compatible with popular data orchestration tools, object storage, and notebooks.

Scale your analysis effortlessly

Confidently process millions of documents and billions of tokens at a time. Sutro handles the infrastructure so you can focus on the insights, not the overhead.

Reduce analysis costs by 10x or more

Get results faster and dramatically reduce costs. Sutro parallelizes your LLM calls, transforming expensive, time-consuming analysis into an efficient, affordable workflow.

From raw text to structured data, simplified

Use a simple Python SDK to transform unstructured data into analytics-ready datasets. Automatically classify, extract, and summarize without involving your ML engineers.

Structured Extraction

Longer description goes here, should span multiple lines.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Document summarization

Easily convert large corpuses of free-form text into concise summaries for semantic search and recommendations.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for your company or your customers.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Structured Extraction

Longer description goes here, should span multiple lines.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Document summarization

Easily convert large corpuses of free-form text into concise summaries for semantic search and recommendations.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for your company or your customers.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Structured Extraction

Longer description goes here, should span multiple lines.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Document summarization

Easily convert large corpuses of free-form text into concise summaries for semantic search and recommendations.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for your company or your customers.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

FAQ

What is Sutro?

How does Sutro save costs?

What kind of tasks can I perform with Sutro?

Do I need to manage my own infrastructure?

How do I use Sutro?

What Will You Analyze with Sutro?

Get Access

Blog

Documentation

Docs

team@sutro.sh