Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Sentiment analysis

Unlock product insights from millions of reviews in minutes

Easily sift through thousands of product reviews, support tickets, and social media posts to unlock valuable product insights. Sutro transforms massive amounts of free-form text into analytics-ready datasets without the pain of managing your own infrastructure.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Raw Text to Actionable Insight, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs. Go from idea to millions of requests for your most ambitious AI projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your sentiment analysis workflow. Accelerate experiments by testing on Sutro before committing to large jobs.

Scale

Scale your LLM workflows so your team can do more in less time. Process billions of tokens from user reviews in hours, not days, with no infrastructure headaches.

Integrate

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, object storage, and notebooks.

Analyze sentiment at a fraction of the cost

Get results faster and reduce costs by 10x or more. Sutro parallelizes your LLM calls, running large batch jobs for a fraction of the cost of traditional methods.

Scale from thousands of reviews to millions

Scale from thousands of reviews to millions

Scale from thousands of reviews to millions

Confidently handle millions of requests and billions of tokens at a time. Process your entire corpus of customer feedback without the pain of managing infrastructure or worrying about scale.

Get feedback from batch jobs in minutes

Shorten development cycles by rapidly prototyping your sentiment analysis workflow. Get feedback from large batch jobs in as little as minutes before committing to a full-scale run.

Product insight mining

Longer description goes here, should span multiple lines.

Text classification

Automatically organize your data into meaningful categories without involving your ML engineer.

Customer review analysis

Convert massive amounts of free-form text from customer feedback into analytics-ready datasets.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Product insight mining

Longer description goes here, should span multiple lines.

Text classification

Automatically organize your data into meaningful categories without involving your ML engineer.

Customer review analysis

Convert massive amounts of free-form text from customer feedback into analytics-ready datasets.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Product insight mining

Longer description goes here, should span multiple lines.

Text classification

Automatically organize your data into meaningful categories without involving your ML engineer.

Customer review analysis

Convert massive amounts of free-form text from customer feedback into analytics-ready datasets.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

FAQ

What is Sutro for?

How does Sutro reduce costs?

Do I need to manage any infrastructure?

How does Sutro integrate with my existing tools?

What kinds of use cases does Sutro support?

What Will You Scale with Sutro?