Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Content localization

Localize Content for Global Audiences in Hours, Not Days

Adapt your marketing, product, and web content for global audiences at a fraction of the cost. Sutro takes the pain away from testing and scaling your most ambitious AI projects.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Idea to Millions of Requests, Simplified

Sutro simplifies every step of your bulk localization workflow, from initial prompt testing to processing millions of pages.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype and Iterate

Start small and iterate fast on your localization workflows. Accelerate experiments by testing on Sutro before committing to large jobs.

Scale with Confidence

Scale your LLM workflows to process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate with Your Stack

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools like Airflow and Dagster.

Reach Global Markets Faster

Run large-scale localization jobs in hours, not days. Confidently handle millions of requests to tailor content for new markets without the pain of managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and reduce costs by parallelizing your LLM calls. Convert massive amounts of free-form text for different locales without exploding costs.

Scale Your Efforts, Not Your Headaches

Scale your LLM workflows so your team can do more in less time. Process billions of tokens at a time to support any number of languages or regions, with no infrastructure to manage.

Content Translation

Longer description goes here, should span multiple lines.

Content personalization

Tailor your marketing and advertising efforts to thousands, or millions of individuals, personas, and demographics.

Bulk content generation

Improve your LLM performance with synthetic data. Generate diverse and representative responses to fill statistical gaps.

Website data extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Personalized email generation

Dramatically increase response rates and ad conversions by tailoring marketing efforts to millions of individuals.

Content Translation

Longer description goes here, should span multiple lines.

Content personalization

Tailor your marketing and advertising efforts to thousands, or millions of individuals, personas, and demographics.

Bulk content generation

Improve your LLM performance with synthetic data. Generate diverse and representative responses to fill statistical gaps.

Website data extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Personalized email generation

Dramatically increase response rates and ad conversions by tailoring marketing efforts to millions of individuals.

Content Translation

Longer description goes here, should span multiple lines.

Content personalization

Tailor your marketing and advertising efforts to thousands, or millions of individuals, personas, and demographics.

Bulk content generation

Improve your LLM performance with synthetic data. Generate diverse and representative responses to fill statistical gaps.

Website data extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Personalized email generation

Dramatically increase response rates and ad conversions by tailoring marketing efforts to millions of individuals.

FAQ

What is Sutro?

How does Sutro save money?

How does Sutro handle large jobs?

Can I integrate Sutro into my existing workflow?

What core tasks can I perform with Sutro?

What Will You Scale with Sutro?