Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Sales call analysis

Analyze Millions of Sales Calls at a Fraction of the Cost

Transform your entire backlog of sales call transcripts into structured, actionable insights. Run LLM batch jobs in hours, not days, to understand customer needs, identify winning patterns, and improve team performance.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Raw Transcripts to Revenue Insights, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs to unlock insights from your sales calls.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype Your Analysis

Start small and iterate fast on your analysis workflows. Shorten development cycles by getting feedback from a small batch of calls in as little as minutes before scaling up.

Scale Your Analysis

Scale your LLM workflows to analyze millions of calls. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate with Your Stack

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Understand Your Entire Customer Base

Stop relying on small samples. Process millions of requests to get a complete picture of customer objections, feature requests, and competitor mentions from your entire call history.

Get Insights Faster, For Less

Get Insights Faster, For Less

Get Insights Faster, For Less

Don't wait days for analysis. Parallelize your LLM calls to process billions of tokens in hours and reduce costs by 10x or more, freeing up your budget and your team's time.

Arm Your Sales Team with Data

Automatically extract key topics, sentiment, and action items from every call. Transform unstructured call transcripts into structured data to enrich your CRM and empower your reps.

Personalized email generation

Longer description goes here, should span multiple lines.

Lead scoring

Enrich your data with meaningful labels to improve data preparation.

Customer review analysis

Easily sift through thousands of product reviews and unlock valuable product insights.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Conversation summarization

Convert large corpuses of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Personalized email generation

Longer description goes here, should span multiple lines.

Lead scoring

Enrich your data with meaningful labels to improve data preparation.

Customer review analysis

Easily sift through thousands of product reviews and unlock valuable product insights.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Conversation summarization

Convert large corpuses of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Personalized email generation

Longer description goes here, should span multiple lines.

Lead scoring

Enrich your data with meaningful labels to improve data preparation.

Customer review analysis

Easily sift through thousands of product reviews and unlock valuable product insights.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Sentiment analysis

Automatically organize your data into meaningful categories without involving your ML engineer.

Conversation summarization

Convert large corpuses of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

FAQ

What can I do with Sutro?

How does Sutro help reduce costs?

How quickly can I process jobs with Sutro?

How do I integrate Sutro into my existing tools?

What are some common use cases for Sutro?

What Will You Scale with Sutro?