Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Audio transcription

Transcribe Audio in Bulk, From Hours to Minutes

Run massive audio transcription jobs in hours, not days, at a fraction of the cost. Convert your audio files into structured, analytics-ready text without the pain of managing infrastructure.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Idea to Millions of Transcripts, Simplified

Sutro takes the pain away from testing and scaling LLM batch transcription jobs to unblock your most ambitious AI projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype

Start small and iterate fast on your transcription workflows. Accelerate experiments by testing on Sutro before committing to large jobs.

Scale

Scale your transcription workflows so your team can do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, object storage, and open data formats.

Scale your transcription workflows effortlessly

Confidently handle millions of audio files at a time. Sutro removes the complexity of managing infrastructure so you can focus on your data.

Reduce transcription costs by 10x or more

Reduce transcription costs by 10x or more

Reduce transcription costs by 10x or more

Get transcriptions faster and significantly reduce costs by parallelizing your LLM calls through Sutro's purpose-built batch processing.

Go from audio files to insights in minutes

Shorten development cycles by getting feedback from large batch jobs in as little as minutes before scaling up your transcription projects.

Conversation summarization

Longer description goes here, should span multiple lines.

Sentiment analysis

Analyze the sentiment of transcribed customer calls, interviews, or feedback to understand opinions at scale.

Sales call analysis

Extract insights from transcribed sales calls to improve coaching, identify winning patterns, and enrich your CRM.

Entity extraction

Automatically identify and extract key entities like names, companies, and locations from your transcribed text.

Unlock Product Insights

Easily sift through thousands of transcribed product reviews and unlock valuable product insights.

RAG data preparation

Easily convert large corpuses of transcribed text into vector representations for semantic search and retrieval.

Conversation summarization

Longer description goes here, should span multiple lines.

Sentiment analysis

Analyze the sentiment of transcribed customer calls, interviews, or feedback to understand opinions at scale.

Sales call analysis

Extract insights from transcribed sales calls to improve coaching, identify winning patterns, and enrich your CRM.

Entity extraction

Automatically identify and extract key entities like names, companies, and locations from your transcribed text.

Unlock Product Insights

Easily sift through thousands of transcribed product reviews and unlock valuable product insights.

RAG data preparation

Easily convert large corpuses of transcribed text into vector representations for semantic search and retrieval.

Conversation summarization

Longer description goes here, should span multiple lines.

Sentiment analysis

Analyze the sentiment of transcribed customer calls, interviews, or feedback to understand opinions at scale.

Sales call analysis

Extract insights from transcribed sales calls to improve coaching, identify winning patterns, and enrich your CRM.

Entity extraction

Automatically identify and extract key entities like names, companies, and locations from your transcribed text.

Unlock Product Insights

Easily sift through thousands of transcribed product reviews and unlock valuable product insights.

RAG data preparation

Easily convert large corpuses of transcribed text into vector representations for semantic search and retrieval.

FAQ

What is Sutro?

How does Sutro save costs?

What kind of workflows can I run on Sutro?

How do I integrate Sutro into my existing systems?

Can I test my workflows before running a large job?

What Will You Scale with Sutro?