Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Bulk content generation

Scale Your Content Engine, Not Your Costs

Generate millions of unique content pieces, from product descriptions to personalized emails, without the infrastructure headaches. Sutro runs LLM batch jobs in hours, not days, at a fraction of the cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Idea to Millions of Requests, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs to unblock your most ambitious content projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype

Start small and iterate fast on your content generation workflows. Accelerate experiments by testing on Sutro before committing to large jobs.

Scale Effortlessly

Scale your LLM workflows so your team can do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate Seamlessly

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Create Content at Scale

Confidently handle millions of requests and billions of tokens at a time. Go from generating a handful of articles to creating personalized content for millions of users without managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and dramatically lower your expenses. Sutro reduces costs by parallelizing your LLM calls, making large-scale content generation financially viable.

Go From Idea to Published in Hours

Shorten your content development cycles. Get feedback from large batch jobs in minutes and process entire campaigns in hours, not days, to move faster than the competition.

Personalized email generation

Longer description goes here, should span multiple lines.

Product description generation

Enrich your messy product catalog data without involving your machine learning engineer.

SEO meta description generation

Crawl millions of web pages and generate unique, structured metadata to improve search visibility and performance.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance.

Documentation generation

Transform unstructured notes and source materials into analytics-ready, structured documentation for any product.

Website data extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Personalized email generation

Longer description goes here, should span multiple lines.

Product description generation

Enrich your messy product catalog data without involving your machine learning engineer.

SEO meta description generation

Crawl millions of web pages and generate unique, structured metadata to improve search visibility and performance.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance.

Documentation generation

Transform unstructured notes and source materials into analytics-ready, structured documentation for any product.

Website data extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Personalized email generation

Longer description goes here, should span multiple lines.

Product description generation

Enrich your messy product catalog data without involving your machine learning engineer.

SEO meta description generation

Crawl millions of web pages and generate unique, structured metadata to improve search visibility and performance.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance.

Documentation generation

Transform unstructured notes and source materials into analytics-ready, structured documentation for any product.

Website data extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

FAQ

What is Sutro?

How does Sutro reduce costs?

How do I use Sutro?

How quickly can Sutro process jobs?

What tools does Sutro integrate with?

What Will You Scale with Sutro?