Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Document summarization

Summarize Millions of Documents for a Fraction of the Cost

Process massive amounts of free-form text into concise, analytics-ready summaries. Sutro runs LLM batch jobs in hours, not days, so you can unlock insights from your documents without managing complex infrastructure.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Raw Text to Production Summaries

Sutro simplifies the entire workflow for large-scale document summarization, from initial testing to full deployment in your existing data stack.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype

Start small and iterate fast on your summarization workflows. Accelerate experiments by testing on a sample of your documents before committing to large jobs.

Scale Effortlessly

Scale your LLM workflows so your team can do more in less time. Process billions of tokens in hours with no infrastructure headaches or exploding costs.

Integrate Seamlessly

Connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Get Results in Hours, Not Days

Shorten development cycles and get feedback faster. Sutro is built to turn massive summarization jobs that would take days into tasks that complete in a fraction of the time.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and reduce costs significantly by parallelizing your LLM calls through Sutro. Process billions of tokens without the exploding costs.

Scale Without Infrastructure Headaches

Confidently handle millions of requests at a time. Easily sift through thousands of product reviews or historical notes without the pain of managing your own infrastructure.

Structured Extraction

Longer description goes here, should span multiple lines.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Structured Extraction

Longer description goes here, should span multiple lines.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Structured Extraction

Longer description goes here, should span multiple lines.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

FAQ

What is Sutro?

How does Sutro help reduce costs?

What kind of tasks can I run on Sutro?

Do I need to manage my own infrastructure?

How do I integrate Sutro into my existing tools?

What Will You Scale with Sutro?