Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Conversation summarization

Summarize Millions of Conversations to Unlock Key Insights

Transform massive amounts of free-form text from meeting notes, product reviews, and support calls into analytics-ready summaries. Run LLM batch jobs in hours, not days, at a fraction of the cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Raw Conversation to Actionable Summary

Sutro takes the pain away from testing and scaling LLM batch jobs with a simple, Pythonic workflow to unblock your most ambitious AI projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your summarization prompts. Accelerate experiments by testing on a sample of conversations before committing to large jobs.

Scale

Scale your summarization workflows so your team can do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale Effortlessly

Confidently handle millions of conversation summaries, and billions of tokens at a time without the pain of managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get summarized results faster and reduce costs by parallelizing your LLM calls through Sutro.

Unlock Insights Faster

Shorten development cycles by getting feedback from large batch summarization jobs in as little as minutes before scaling up to your entire dataset.

Sales call analysis

Longer description goes here, should span multiple lines.

Product insight mining

Sift through thousands of product reviews to unlock valuable product insights while brewing your morning coffee.

Customer review analysis

Process and categorize customer feedback to understand sentiment and guide product development.

Document summarization

Condense large documents into concise abstracts for easier review and analysis.

Sentiment analysis

Automatically organize your conversational data into meaningful categories like positive, neutral, or negative.

Performance review summarization

Distill lengthy performance reviews into key takeaways for HR and management.

Sales call analysis

Longer description goes here, should span multiple lines.

Product insight mining

Sift through thousands of product reviews to unlock valuable product insights while brewing your morning coffee.

Customer review analysis

Process and categorize customer feedback to understand sentiment and guide product development.

Document summarization

Condense large documents into concise abstracts for easier review and analysis.

Sentiment analysis

Automatically organize your conversational data into meaningful categories like positive, neutral, or negative.

Performance review summarization

Distill lengthy performance reviews into key takeaways for HR and management.

Sales call analysis

Longer description goes here, should span multiple lines.

Product insight mining

Sift through thousands of product reviews to unlock valuable product insights while brewing your morning coffee.

Customer review analysis

Process and categorize customer feedback to understand sentiment and guide product development.

Document summarization

Condense large documents into concise abstracts for easier review and analysis.

Sentiment analysis

Automatically organize your conversational data into meaningful categories like positive, neutral, or negative.

Performance review summarization

Distill lengthy performance reviews into key takeaways for HR and management.

FAQ

What is Sutro?

What tasks can I perform with Sutro?

How does Sutro reduce costs?

Can I integrate Sutro with my current data tools?

How does Sutro help me scale my summarization tasks?

What Will You Scale with Sutro?