Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Product description enrichment

Enrich Your Entire Product Catalog for a Fraction of the Cost

Automatically enhance your messy product catalog data and enrich your CRM entries. Sutro makes it simple to process millions of products in a single batch job, improving data quality and driving business decisions.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Basic Data to Enriched Catalog, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs. Our Python-native workflow lets you start small, test your logic, and scale to millions of items with ease.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype Your Prompt

Start small and iterate fast on your LLM batch workflows. Accelerate experiments by testing on a small set of products before committing to large jobs.

Scale Your Workflow

Scale your LLM workflows to do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate with Your Stack

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale to Millions of SKUs Effortlessly

Confidently handle millions of product requests and billions of tokens at a time. Enrich your entire catalog without the pain of managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and reduce costs by parallelizing your LLM calls through Sutro. Transform your product data at a fraction of the cost of other methods.

Go from Raw Data to Enriched Catalog in Hours

Shorten development cycles by getting feedback from large batch jobs in hours, not days. Run LLM batch jobs to enrich your product data while you brew your morning coffee.

Product description generation

Longer description goes here, should span multiple lines.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights.

Structured Extraction

Transform unstructured data from product descriptions or reviews into structured insights that drive business decisions.

Customer review analysis

Analyze sentiment and extract key topics from thousands of customer reviews to understand user feedback.

Personalized email generation

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics.

Embedding Generation

Easily convert large corpuses of product text into vector representations for semantic search and recommendations.

Product description generation

Longer description goes here, should span multiple lines.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights.

Structured Extraction

Transform unstructured data from product descriptions or reviews into structured insights that drive business decisions.

Customer review analysis

Analyze sentiment and extract key topics from thousands of customer reviews to understand user feedback.

Personalized email generation

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics.

Embedding Generation

Easily convert large corpuses of product text into vector representations for semantic search and recommendations.

Product description generation

Longer description goes here, should span multiple lines.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights.

Structured Extraction

Transform unstructured data from product descriptions or reviews into structured insights that drive business decisions.

Customer review analysis

Analyze sentiment and extract key topics from thousands of customer reviews to understand user feedback.

Personalized email generation

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics.

Embedding Generation

Easily convert large corpuses of product text into vector representations for semantic search and recommendations.

FAQ

What is Sutro?

How does Sutro reduce costs?

What are Sutro's main capabilities?

How do I use Sutro?

What kind of scale can Sutro handle?

What Will You Scale with Sutro?