Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Personalized email generation

Personalize Outreach for Millions at a Fraction of the Cost

Run LLM batch jobs to tailor your marketing and advertising efforts to thousands, or millions of individuals, personas, and demographics. Dramatically increase response rates and ad conversions without the pain of managing infrastructure.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Idea to Personalized Campaign, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs. Seamlessly connect to your existing workflows to generate personalized emails at scale.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype Your Messaging

Start small and iterate fast on your email copy. Accelerate experiments by testing on Sutro before committing to a large job.

Scale to Your Entire List

Scale your LLM workflows so your team can do more in less time. Process millions of personalized emails in hours, with no infrastructure headaches.

Integrate with Your Tools

Seamlessly connect Sutro to your existing marketing and data workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Engage millions, not thousands

Confidently handle millions of requests to generate personalized content. Scale your outreach effortlessly without worrying about infrastructure or exploding costs.

Reduce outreach costs by 10x or more

Reduce outreach costs by 10x or more

Reduce outreach costs by 10x or more

Get results faster and reduce costs by parallelizing your LLM calls through Sutro. Process your entire campaign at a fraction of the cost of traditional, one-by-one API calls.

Launch campaigns in hours, not days

Process billions of tokens to generate an entire email campaign in hours. Shorten development cycles by getting feedback from large batch jobs in minutes before scaling up.

Content Personalization

Longer description goes here, should span multiple lines.

Lead Scoring

Automatically organize and analyze lead data to prioritize outreach and help your sales team focus on the most promising opportunities.

Enrich Data

Improve your messy product catalog data or enrich your CRM entries without involving your machine learning engineer.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights to inform your marketing and product strategy.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance.

Customer review analysis

Transform unstructured reviews into structured insights that drive business decisions and improve customer satisfaction.

Content Personalization

Longer description goes here, should span multiple lines.

Lead Scoring

Automatically organize and analyze lead data to prioritize outreach and help your sales team focus on the most promising opportunities.

Enrich Data

Improve your messy product catalog data or enrich your CRM entries without involving your machine learning engineer.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights to inform your marketing and product strategy.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance.

Customer review analysis

Transform unstructured reviews into structured insights that drive business decisions and improve customer satisfaction.

Content Personalization

Longer description goes here, should span multiple lines.

Lead Scoring

Automatically organize and analyze lead data to prioritize outreach and help your sales team focus on the most promising opportunities.

Enrich Data

Improve your messy product catalog data or enrich your CRM entries without involving your machine learning engineer.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights to inform your marketing and product strategy.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance.

Customer review analysis

Transform unstructured reviews into structured insights that drive business decisions and improve customer satisfaction.

FAQ

What is Sutro?

How does Sutro save money?

What kind of tasks is Sutro built for?

How do I use Sutro?

Does Sutro work with my existing tools?

What Will You Scale with Sutro?