Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Contact info extraction

Extract Millions of Contacts, Faster and for Less

Transform unstructured data like web pages, documents, and CRM entries into structured contact information. Run LLM batch jobs in hours, not days, at a fraction of the cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Unstructured Text to Actionable Contacts

Sutro simplifies the entire process of extracting contact information at scale. Start small, test your extraction logic, and scale to millions of records with ease.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype and Iterate Fast

Start small and iterate fast on your contact extraction workflows. Accelerate experiments by testing on Sutro before committing to large jobs.

Scale to Millions of Records

Scale your extraction workflows to process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate with Your Existing Tools

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale Effortlessly

Confidently handle millions of requests to extract contact info from documents or web pages. Process billions of tokens at a time without the pain of managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and significantly reduce costs. Sutro parallelizes your LLM calls to transform massive amounts of free-form text into analytics-ready datasets.

Get Results in Hours, Not Days

Sutro takes the pain away from testing and scaling LLM batch jobs. Shorten development cycles and process millions of records in hours, not days.

Structured Extraction

Longer description goes here, should span multiple lines.

Website Data Extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Invoice Data Extraction

Automate the extraction of key information from invoices to streamline your accounting and payment processes.

Resume Screening

Efficiently parse thousands of resumes to identify qualified candidates by extracting skills, experience, and education.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Data Enrichment

Improve your messy product catalog data or enrich your CRM entries without involving your machine learning engineer.

Structured Extraction

Longer description goes here, should span multiple lines.

Website Data Extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Invoice Data Extraction

Automate the extraction of key information from invoices to streamline your accounting and payment processes.

Resume Screening

Efficiently parse thousands of resumes to identify qualified candidates by extracting skills, experience, and education.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Data Enrichment

Improve your messy product catalog data or enrich your CRM entries without involving your machine learning engineer.

Structured Extraction

Longer description goes here, should span multiple lines.

Website Data Extraction

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Invoice Data Extraction

Automate the extraction of key information from invoices to streamline your accounting and payment processes.

Resume Screening

Efficiently parse thousands of resumes to identify qualified candidates by extracting skills, experience, and education.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Data Enrichment

Improve your messy product catalog data or enrich your CRM entries without involving your machine learning engineer.

FAQ

What is Sutro?

How does Sutro save costs?

What kind of data can I process?

Does Sutro integrate with my current workflow?

What use cases is Sutro good for?

What Will You Scale with Sutro?