Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

SEO meta description generation

Generate SEO Meta Descriptions For Millions of Pages, Instantly

Run LLM batch jobs to create unique, optimized meta descriptions for your entire site. Get results in hours, not days, at a fraction of the cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Prompt to Published, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs for all your SEO needs.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your prompts. Accelerate experiments by testing on a subset of pages before committing to a large job for your whole site.

Scale

Scale your LLM workflows to generate millions of meta descriptions. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing content workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale Your SEO Content Effortlessly

Confidently handle millions of pages at a time. Generate unique meta descriptions for your entire product catalog or content library without the pain of managing infrastructure.

Reduce Content Generation Costs by 10x

Reduce Content Generation Costs by 10x

Reduce Content Generation Costs by 10x

Get results faster and dramatically reduce costs. Sutro parallelizes your LLM calls, making large-scale SEO content generation affordable.

Rapidly Prototype Your Prompts

Shorten development cycles by testing different prompts on large batches of pages. Get feedback and refine your approach in minutes before scaling up to your entire site.

Bulk content generation

Longer description goes here, should span multiple lines.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for competitive analysis or internal use.

Product description generation

Automatically write compelling and unique descriptions for thousands or millions of products in your catalog.

Content personalization

Tailor marketing and advertising efforts to millions of individuals by generating personalized content at scale.

Structured Extraction

Transform unstructured web content or documents into structured insights that drive business decisions.

Content Translation

Translate your entire website or content library into multiple languages quickly and cost-effectively.

Bulk content generation

Longer description goes here, should span multiple lines.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for competitive analysis or internal use.

Product description generation

Automatically write compelling and unique descriptions for thousands or millions of products in your catalog.

Content personalization

Tailor marketing and advertising efforts to millions of individuals by generating personalized content at scale.

Structured Extraction

Transform unstructured web content or documents into structured insights that drive business decisions.

Content Translation

Translate your entire website or content library into multiple languages quickly and cost-effectively.

Bulk content generation

Longer description goes here, should span multiple lines.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets for competitive analysis or internal use.

Product description generation

Automatically write compelling and unique descriptions for thousands or millions of products in your catalog.

Content personalization

Tailor marketing and advertising efforts to millions of individuals by generating personalized content at scale.

Structured Extraction

Transform unstructured web content or documents into structured insights that drive business decisions.

Content Translation

Translate your entire website or content library into multiple languages quickly and cost-effectively.

FAQ

What is Sutro?

How does Sutro help reduce costs?

Can I test my prompts before running them on my whole site?

How does Sutro fit into my existing tech stack?

What types of tasks is Sutro built for?

What Will You Scale with Sutro?