Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Performance review summarization

Unlock Insights from Thousands of Performance Reviews in Minutes

Effortlessly process and summarize entire cycles of employee performance reviews at once. Sutro transforms unstructured text into actionable insights, saving you time and reducing costs by processing requests in a single batch job.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Raw Reviews to Actionable Summaries, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs. Simply connect your data, define your task, and get results without complex infrastructure.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your summarization prompts and workflows. Accelerate experiments by testing on a small batch of reviews before committing to the full job.

Scale

Scale your summarization workflow to process thousands of reviews in hours. Do more in less time with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing HR systems and data workflows. Sutro's Python SDK is compatible with popular data orchestration tools like Airflow and Dagster.

Gain Insights at Scale

Confidently process thousands of performance reviews at a time. Scale your HR analytics without the pain of managing infrastructure or worrying about rate limits.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get your summarization results faster and significantly reduce costs by parallelizing LLM calls through Sutro's batch processing API.

Accelerate Your Review Cycles

Shorten feedback loops by getting insights from large batches of performance reviews in hours, not days, freeing up your team to focus on strategic initiatives.

Resume Screening

Longer description goes here, should span multiple lines.

Sentiment Analysis

Gauge the sentiment of employee feedback, surveys, and reviews at scale.

Conversation Summarization

Distill key takeaways from meeting notes and internal communications.

Job Description Parsing

Extract key requirements and responsibilities from job descriptions for better analysis.

Document Tagging

Automatically organize your HR documents into meaningful categories without involving an ML engineer.

Structured Extraction

Transform unstructured HR documents into structured insights that drive decisions.

Resume Screening

Longer description goes here, should span multiple lines.

Sentiment Analysis

Gauge the sentiment of employee feedback, surveys, and reviews at scale.

Conversation Summarization

Distill key takeaways from meeting notes and internal communications.

Job Description Parsing

Extract key requirements and responsibilities from job descriptions for better analysis.

Document Tagging

Automatically organize your HR documents into meaningful categories without involving an ML engineer.

Structured Extraction

Transform unstructured HR documents into structured insights that drive decisions.

Resume Screening

Longer description goes here, should span multiple lines.

Sentiment Analysis

Gauge the sentiment of employee feedback, surveys, and reviews at scale.

Conversation Summarization

Distill key takeaways from meeting notes and internal communications.

Job Description Parsing

Extract key requirements and responsibilities from job descriptions for better analysis.

Document Tagging

Automatically organize your HR documents into meaningful categories without involving an ML engineer.

Structured Extraction

Transform unstructured HR documents into structured insights that drive decisions.

FAQ

What is Sutro?

What kinds of tasks can I perform with Sutro?

How does Sutro reduce costs?

Can I use Sutro with my existing data tools?

Do I need to manage my own infrastructure?

What Will You Scale with Sutro?