Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Product description generation

Generate a million product descriptions in hours, not days

Create high-quality, on-brand descriptions for your entire product catalog at a fraction of the cost. Sutro takes the pain away from scaling LLM batch jobs to unblock your most ambitious e-commerce projects.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Idea to Full Catalog, Simplified

Sutro simplifies the entire process of generating product descriptions at scale with a simple, Python-native workflow.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your product description prompts. Accelerate experiments by testing on a sample of your catalog before committing to a large job.

Scale

Scale your LLM workflows so your team can do more in less time. Process billions of tokens to generate descriptions for every product in hours, not days, with no infrastructure headaches.

Integrate

Seamlessly connect Sutro to your existing e-commerce and data workflows. Sutro's Python SDK is compatible with popular data orchestration tools like Airflow and Dagster.

Reduce Costs by 10x or More

Get results faster and significantly lower your expenses. Sutro parallelizes your LLM calls to generate product descriptions for your entire catalog at a fraction of the cost of traditional methods.

Scale Effortlessly

Scale Effortlessly

Scale Effortlessly

Confidently handle millions of product SKUs and billions of tokens at a time. Go from a handful of descriptions to your full catalog without the pain of managing complex infrastructure.

Rapidly Prototype and Iterate

Shorten development cycles by testing different prompts and styles. Get feedback from large batch jobs in as little as minutes before committing to generating descriptions for every product.

Product description enrichment

Longer description goes here, should span multiple lines.

Customer review analysis

Easily sift through thousands of product reviews and unlock valuable product insights to inform your next steps.

Structured Extraction

Transform unstructured data from manufacturer spec sheets or web pages into structured insights that drive business decisions.

Content personalization

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics to increase response rates.

Embedding Generation

Easily convert large corpuses of product information into vector representations for semantic search and recommendations.

Improve Model Performance

Improve your LLM or RAG retrieval performance with synthetic data. Generate diverse responses to fill statistical gaps.

Product description enrichment

Longer description goes here, should span multiple lines.

Customer review analysis

Easily sift through thousands of product reviews and unlock valuable product insights to inform your next steps.

Structured Extraction

Transform unstructured data from manufacturer spec sheets or web pages into structured insights that drive business decisions.

Content personalization

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics to increase response rates.

Embedding Generation

Easily convert large corpuses of product information into vector representations for semantic search and recommendations.

Improve Model Performance

Improve your LLM or RAG retrieval performance with synthetic data. Generate diverse responses to fill statistical gaps.

Product description enrichment

Longer description goes here, should span multiple lines.

Customer review analysis

Easily sift through thousands of product reviews and unlock valuable product insights to inform your next steps.

Structured Extraction

Transform unstructured data from manufacturer spec sheets or web pages into structured insights that drive business decisions.

Content personalization

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics to increase response rates.

Embedding Generation

Easily convert large corpuses of product information into vector representations for semantic search and recommendations.

Improve Model Performance

Improve your LLM or RAG retrieval performance with synthetic data. Generate diverse responses to fill statistical gaps.

FAQ

What is Sutro?

How does Sutro reduce costs?

Can I use Sutro with my existing data tools?

How does Sutro handle millions of requests?

What are the primary functions of Sutro?

What Will You Scale with Sutro?