Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

FAQ generation

Generate thousands of relevant FAQs at a fraction of the cost

Turn documentation, web pages, and product guides into comprehensive FAQs. Sutro processes millions of requests at once, allowing you to build helpful resources effortlessly and affordably in hours, not days.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Documentation to FAQ in Three Steps

Sutro's Python SDK simplifies generating FAQs from your existing content. Our workflow lets you start small and scale up confidently.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start with a small sample of your content. Iterate fast on your prompts to find the perfect structure and tone for your FAQs before committing to large jobs.

Scale

Scale your workflow to process millions of web pages or your entire document corpus. Process billions of tokens in hours, not days, with no infrastructure headaches.

Integrate

Seamlessly connect Sutro to your existing LLM workflows. Our Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale Effortlessly

Go from a single document to millions. Sutro handles the infrastructure, letting you generate FAQs for your entire knowledge base without performance bottlenecks or the pain of managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

By parallelizing your LLM calls through Sutro, you can get results faster and dramatically reduce the cost of generating FAQs at scale.

Shorten Development Cycles

Rapidly prototype your FAQ generation workflow on large batches. Get feedback in minutes before scaling up to accelerate experiments and ship faster.

Document summarization

Longer description goes here, should span multiple lines.

RAG data preparation

Easily convert large corpuses of text into vector representations for semantic search.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets.

Documentation generation

Generate high-quality, diverse, and representative synthetic data to improve model performance.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Content personalization

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics.

Document summarization

Longer description goes here, should span multiple lines.

RAG data preparation

Easily convert large corpuses of text into vector representations for semantic search.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets.

Documentation generation

Generate high-quality, diverse, and representative synthetic data to improve model performance.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Content personalization

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics.

Document summarization

Longer description goes here, should span multiple lines.

RAG data preparation

Easily convert large corpuses of text into vector representations for semantic search.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets.

Documentation generation

Generate high-quality, diverse, and representative synthetic data to improve model performance.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Content personalization

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics.

FAQ

What is Sutro?

How does Sutro reduce costs?

Do I need to manage my own infrastructure?

What kind of tasks can I perform with Sutro?

Does Sutro work with my existing tools?

What Will You Scale with Sutro?