Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Structured Extraction

Turn millions of documents into analytics-ready data in hours, not days

Sutro transforms unstructured data into structured insights that drive business decisions. Run LLM batch jobs at a fraction of the cost and without the pain of managing infrastructure.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

A simple, scalable workflow for data extraction

Sutro's Python SDK and simple workflow lets you start small with your extraction tasks and scale to millions of requests effortlessly.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype

Start small and iterate fast on your LLM batch workflows. Accelerate experiments by testing your extraction logic on Sutro before committing to large jobs.

Scale Effortlessly

Scale your extraction workflows so your team can do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches.

Integrate Seamlessly

Connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Extract insights at massive scale

Confidently handle millions of requests, and billions of tokens at a time. Crawl millions of web pages or process large corpuses of free-form text without the pain of managing infrastructure.

Reduce data processing costs by 10x or more

Reduce data processing costs by 10x or more

Reduce data processing costs by 10x or more

Get results faster and reduce costs significantly by parallelizing your LLM calls. Process your unstructured ETL workflows without exploding costs.

Get analytics-ready data faster

Shorten development cycles and get feedback from large batch jobs in minutes. Unlock valuable product insights from thousands of reviews while you brew your morning coffee.

Unstructured ETL

Longer description goes here, should span multiple lines.

Data Enrichment

Improve your messy product catalog data, enrich your CRM entries, or gather insights from your historical meeting notes without involving your machine learning engineer.

Classification

Automatically organize your data into meaningful categories without involving your ML engineer.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Synthetic Data Generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance, without the complexity.

Unstructured ETL

Longer description goes here, should span multiple lines.

Data Enrichment

Improve your messy product catalog data, enrich your CRM entries, or gather insights from your historical meeting notes without involving your machine learning engineer.

Classification

Automatically organize your data into meaningful categories without involving your ML engineer.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Synthetic Data Generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance, without the complexity.

Unstructured ETL

Longer description goes here, should span multiple lines.

Data Enrichment

Improve your messy product catalog data, enrich your CRM entries, or gather insights from your historical meeting notes without involving your machine learning engineer.

Classification

Automatically organize your data into meaningful categories without involving your ML engineer.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Synthetic Data Generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance, without the complexity.

FAQ

What can I do with Sutro?

How does Sutro help reduce costs?

How does Sutro integrate with my existing tools?

How does Sutro help with scaling?

How can I start prototyping with Sutro?

What Will You Scale with Sutro?