Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Document OCR

Unlock Insights from Millions of Documents, Instantly

Turn vast quantities of unstructured documents—from patents and invoices to web pages and historical notes—into structured, analytics-ready data. Sutro makes large-scale document processing simple, fast, and affordable.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

A Simplified Workflow for Document Processing

Sutro takes the pain away from testing and scaling LLM batch jobs to unblock your most ambitious document digitization projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your document extraction workflows. Accelerate experiments by testing on a small batch of documents with Sutro before committing to large jobs.

Scale

Scale your document processing workflows so your team can do more in less time. Process billions of tokens from your documents in hours, with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing data workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster, and works with data from any object storage.

Scale Effortlessly

Confidently process millions of documents and billions of tokens at a time. Go from a single page to your entire archive without the pain of managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get structured data faster and significantly reduce costs. Sutro parallelizes LLM calls to run batch jobs at a fraction of the cost of other methods.

Go from Documents to Data in Hours, Not Days

Shorten development cycles by getting feedback from large document batches in minutes. Run massive OCR and extraction jobs and get complete results in hours.

Structured Extraction

Longer description goes here, should span multiple lines.

Invoice data extraction

Automate the extraction of critical information from invoices and financial documents.

Document summarization

Condense large documents into concise summaries to speed up analysis.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without infrastructure pains.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets.

Structured Extraction

Longer description goes here, should span multiple lines.

Invoice data extraction

Automate the extraction of critical information from invoices and financial documents.

Document summarization

Condense large documents into concise summaries to speed up analysis.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without infrastructure pains.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets.

Structured Extraction

Longer description goes here, should span multiple lines.

Invoice data extraction

Automate the extraction of critical information from invoices and financial documents.

Document summarization

Condense large documents into concise summaries to speed up analysis.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without infrastructure pains.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets.

FAQ

What can I do with Sutro?

How does Sutro help reduce costs?

Is Sutro difficult to integrate into my current setup?

How fast is Sutro for large jobs?

What kind of scale can Sutro handle?

What Will You Scale with Sutro?