Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Invoice data extraction

Transform Millions of Invoices into Structured Data in Hours, Not Days

Run LLM batch jobs to transform unstructured invoice data into structured insights that drive business decisions, at a fraction of the cost and without managing complex infrastructure.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Invoice Scans to Structured Insights, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs for your most ambitious data extraction projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype

Shorten development cycles by getting feedback from large batch jobs in minutes. Start small and iterate fast on your invoice extraction workflows before committing to large jobs.

Scale to Production

Scale your LLM workflows to process billions of tokens in hours, not days. Do more in less time with no infrastructure headaches or exploding costs.

Integrate with Your Stack

Seamlessly connect Sutro to your existing workflows. Sutro's Python SDK is compatible with popular data orchestration tools, object storage, and open data formats.

Scale Effortlessly

Confidently handle millions of invoice extraction requests, processing billions of tokens at a time without the pain of managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and reduce costs significantly by parallelizing your LLM calls through Sutro's purpose-built platform for batch jobs.

Simplify Unstructured ETL

Convert massive amounts of free-form text from invoices into analytics-ready datasets without the pain of managing your own infrastructure.

Contact info extraction

Longer description goes here, should span multiple lines.

Job description parsing

Transform job descriptions into structured data fields for analysis and candidate matching.

Structured Extraction

Turn any unstructured data into structured, analytics-ready datasets that drive business decisions.

Document summarization

Condense large documents into concise summaries to quickly find key information.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without managing your own infrastructure.

Contact info extraction

Longer description goes here, should span multiple lines.

Job description parsing

Transform job descriptions into structured data fields for analysis and candidate matching.

Structured Extraction

Turn any unstructured data into structured, analytics-ready datasets that drive business decisions.

Document summarization

Condense large documents into concise summaries to quickly find key information.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without managing your own infrastructure.

Contact info extraction

Longer description goes here, should span multiple lines.

Job description parsing

Transform job descriptions into structured data fields for analysis and candidate matching.

Structured Extraction

Turn any unstructured data into structured, analytics-ready datasets that drive business decisions.

Document summarization

Condense large documents into concise summaries to quickly find key information.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without managing your own infrastructure.

FAQ

What kind of tasks can I run with Sutro?

How does Sutro reduce costs?

Can Sutro handle large-scale jobs?

How do I integrate Sutro into my workflow?

What is structured extraction?

What Will You Scale with Sutro?