Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Job description parsing

Turn millions of job descriptions into structured data, instantly

Stop wasting time on manual data entry or building complex infrastructure. Sutro transforms unstructured job descriptions from any source into analytics-ready datasets, letting you run LLM batch jobs in hours, not days, at a fraction of the cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Raw Text to Recruiter-Ready Data, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs to unlock your most ambitious data projects.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype Your Parser

Start small and iterate fast. Define your desired structured output and test your extraction logic on a small batch of job descriptions to accelerate experiments before committing to large jobs.

Scale Your Extraction

Scale your LLM workflows to process millions of job descriptions and billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate Your Data

Seamlessly connect Sutro to your existing recruiting and data workflows. Sutro's Python SDK is compatible with popular data orchestration tools like Airflow and Dagster.

Scale your sourcing effortlessly

Confidently handle millions of job descriptions at a time. Process billions of tokens without the pain of managing infrastructure, so your team can do more in less time.

Reduce costs by 10x or more

Reduce costs by 10x or more

Reduce costs by 10x or more

Get results faster and significantly lower your expenses. Sutro parallelizes your LLM calls to process huge volumes of text far more efficiently than traditional methods.

Shorten your development cycles

Rapidly prototype your data extraction logic. Get feedback from large batch jobs in minutes, allowing you to test and iterate on your parsing workflows before committing to a full-scale run.

Resume screening

Longer description goes here, should span multiple lines.

Structured Extraction

Transform any unstructured data, from documents to web pages, into structured insights.

Contact info extraction

Pull contact details from emails, documents, and web pages in bulk.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets from complex sites.

Invoice data extraction

Automate accounts payable by extracting key information from invoices at scale.

Performance review summarization

Distill thousands of performance reviews into key themes and actionable insights.

Resume screening

Longer description goes here, should span multiple lines.

Structured Extraction

Transform any unstructured data, from documents to web pages, into structured insights.

Contact info extraction

Pull contact details from emails, documents, and web pages in bulk.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets from complex sites.

Invoice data extraction

Automate accounts payable by extracting key information from invoices at scale.

Performance review summarization

Distill thousands of performance reviews into key themes and actionable insights.

Resume screening

Longer description goes here, should span multiple lines.

Structured Extraction

Transform any unstructured data, from documents to web pages, into structured insights.

Contact info extraction

Pull contact details from emails, documents, and web pages in bulk.

Website data extraction

Crawl millions of web pages and extract analytics-ready datasets from complex sites.

Invoice data extraction

Automate accounts payable by extracting key information from invoices at scale.

Performance review summarization

Distill thousands of performance reviews into key themes and actionable insights.

FAQ

What is Sutro?

How does Sutro reduce costs?

What kind of tasks can I run on Sutro?

How do I integrate Sutro with my current tools?

Can I test my job before running it at scale?

What Will You Scale with Sutro?