Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Synthetic data generation

Generate massive, high-quality datasets in hours, not days

Improve your LLM or RAG performance with diverse and representative synthetic data. Sutro takes the pain away from generating data at scale, helping you fill statistical gaps without complex infrastructure or exploding costs.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Idea to Millions of Requests, Simplified

Sutro simplifies the entire batch workflow, letting you start small, test your data generation prompts, and scale up to massive jobs with ease.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your data generation prompts. Accelerate experiments by testing on Sutro with a small batch before committing to a large job, getting feedback in as little as minutes.

Scale

Scale your synthetic data generation to process billions of tokens in hours, not days. Sutro handles the infrastructure so your team can do more in less time without exploding costs.

Integrate

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools like Airflow and Dagster.

Improve Model Performance

Generate high-quality, diverse, and representative synthetic data to fill statistical gaps and improve your LLM or RAG retrieval performance, without the complexity.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and dramatically lower your expenses. Sutro parallelizes your LLM calls to generate data at a fraction of the cost of other methods.

Scale Effortlessly

Go from idea to millions of data points without the pain of managing infrastructure. Confidently handle billions of tokens at a time for your most ambitious AI projects.

RAG data preparation

Longer description goes here, should span multiple lines.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Feature engineering

Enrich your data with meaningful labels to improve model training and data preparation.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Document summarization

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

RAG data preparation

Longer description goes here, should span multiple lines.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Feature engineering

Enrich your data with meaningful labels to improve model training and data preparation.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Document summarization

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

RAG data preparation

Longer description goes here, should span multiple lines.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Feature engineering

Enrich your data with meaningful labels to improve model training and data preparation.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Document summarization

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

FAQ

What is Sutro?

How does Sutro reduce costs?

What kind of scale can Sutro handle?

Does Sutro work with my existing tools?

What are Sutro's core capabilities?

What Will You Scale with Sutro?