Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Content Translation

Translate Content at Scale for a Global Audience

Run LLM batch jobs to translate millions of web pages, product descriptions, or user reviews in hours, not days, at a fraction of the cost. Sutro takes the pain away from scaling your most ambitious translation projects.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Local to Global, Simplified

Sutro simplifies every step of your bulk translation workflow, from initial testing to full-scale deployment.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype

Start small and iterate fast on your translation prompts. Accelerate experiments by testing on a sample of your content with Sutro before committing to large jobs.

Scale Effortlessly

Scale your LLM translation workflows to do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate Seamlessly

Connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Translate Your Entire Content Library

Confidently handle millions of translation requests and billions of tokens at a time. Go global by translating everything from your website to your product catalog without the pain of managing infrastructure.

Reduce Translation Costs by 10x or More

Reduce Translation Costs by 10x or More

Reduce Translation Costs by 10x or More

Get results faster and significantly reduce costs. Sutro parallelizes your LLM calls to process massive amounts of text efficiently, making large-scale translation economically viable.

Launch in New Markets Faster

Shorten development cycles by getting feedback from large translation jobs in as little as minutes. Accelerate your global expansion by testing and scaling your translation workflows at unprecedented speed.

Content localization

Longer description goes here, should span multiple lines.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for your company or your customers.

Personalized email generation

Tailor your marketing efforts to thousands, or millions of individuals, to dramatically increase response rates and ad conversions.

Product description enrichment

Improve your messy product catalog data by enriching descriptions with new details, specifications, or marketing copy.

Sentiment analysis

Automatically organize your data into meaningful categories, like analyzing the sentiment of thousands of product reviews.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions, pulling key information from large text corpuses.

Content localization

Longer description goes here, should span multiple lines.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for your company or your customers.

Personalized email generation

Tailor your marketing efforts to thousands, or millions of individuals, to dramatically increase response rates and ad conversions.

Product description enrichment

Improve your messy product catalog data by enriching descriptions with new details, specifications, or marketing copy.

Sentiment analysis

Automatically organize your data into meaningful categories, like analyzing the sentiment of thousands of product reviews.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions, pulling key information from large text corpuses.

Content localization

Longer description goes here, should span multiple lines.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for your company or your customers.

Personalized email generation

Tailor your marketing efforts to thousands, or millions of individuals, to dramatically increase response rates and ad conversions.

Product description enrichment

Improve your messy product catalog data by enriching descriptions with new details, specifications, or marketing copy.

Sentiment analysis

Automatically organize your data into meaningful categories, like analyzing the sentiment of thousands of product reviews.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions, pulling key information from large text corpuses.

FAQ

What is Sutro?

How does Sutro help me save money?

Can I test my workflow before running a large job?

What kind of scale can Sutro handle?

Does Sutro work with my existing tools?

What Will You Scale with Sutro?