Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Text classification

Automatically classify millions of text records in hours, not days

Transform massive amounts of free-form text into structured, organized categories. Run large-scale classification jobs on product reviews, support tickets, or user feedback with a few lines of Python and get results 10x faster and cheaper.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

A simple, powerful workflow for text classification

Sutro takes the pain away from testing and scaling LLM batch classification jobs to unblock your most ambitious AI projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your classification prompts and schemas. Accelerate experiments by testing on Sutro before committing to large jobs.

Scale

Scale your classification workflow so your team can do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing data workflows. Sutro's Python SDK is compatible with popular data orchestration tools like Airflow and Dagster.

Scale your classification effortlessly

Confidently handle millions of requests and billions of tokens at a time. Automatically organize all your data, from product reviews to CRM entries, without the pain of managing infrastructure.

Reduce classification costs by 10x or more

Reduce classification costs by 10x or more

Reduce classification costs by 10x or more

Get results faster and slash your budget. Sutro's batch processing architecture dramatically reduces the cost of running large-scale LLM classification jobs.

From raw text to insights in minutes

Shorten development cycles by getting feedback from large batch jobs in as little as minutes. Go from idea to millions of classified records without waiting days for results.

Sentiment analysis

Longer description goes here, should span multiple lines.

Document tagging

Automatically apply relevant tags to large volumes of documents for improved organization and search.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Customer review analysis

Easily sift through thousands of product reviews to unlock valuable product insights.

Lead scoring

Analyze and score incoming leads to prioritize your sales team's efforts.

RAG data preparation

Prepare and enrich large corpuses of text for improved retrieval-augmented generation performance.

Sentiment analysis

Longer description goes here, should span multiple lines.

Document tagging

Automatically apply relevant tags to large volumes of documents for improved organization and search.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Customer review analysis

Easily sift through thousands of product reviews to unlock valuable product insights.

Lead scoring

Analyze and score incoming leads to prioritize your sales team's efforts.

RAG data preparation

Prepare and enrich large corpuses of text for improved retrieval-augmented generation performance.

Sentiment analysis

Longer description goes here, should span multiple lines.

Document tagging

Automatically apply relevant tags to large volumes of documents for improved organization and search.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Customer review analysis

Easily sift through thousands of product reviews to unlock valuable product insights.

Lead scoring

Analyze and score incoming leads to prioritize your sales team's efforts.

RAG data preparation

Prepare and enrich large corpuses of text for improved retrieval-augmented generation performance.

FAQ

What is Sutro?

What can I do with Sutro?

How does Sutro help reduce costs?

Can I integrate Sutro with my existing tools?

What kind of use cases is Sutro good for?

What Will You Scale with Sutro?