Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Data mastering

Create a single source of truth from millions of records in hours

Unify disparate, unstructured, and messy data sources into a clean, analytics-ready dataset. Sutro takes the pain away from testing and scaling LLM batch jobs, helping you enrich data and transform free-form text into structured insights without managing infrastructure.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Messy Data to Mastered, Simplified

Sutro provides a simple, Python-native workflow to master your data at any scale.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your data mastering workflows. Accelerate experiments by testing on a sample of your data before committing to a large job.

Scale

Scale your data mastering workflows so your team can do more in less time. Process billions of tokens in hours with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale effortlessly

Confidently handle millions of records and billions of tokens at a time. Process your entire dataset for mastering without the pain of managing infrastructure or worrying about scale.

Reduce costs by 10x or more

Reduce costs by 10x or more

Reduce costs by 10x or more

Get results faster and reduce costs significantly by parallelizing your LLM calls. Avoid expensive engineering resources and infrastructure overhead for your data mastering projects.

Go from raw data to mastered in hours

Shorten development cycles by running large-scale data mastering jobs in hours, not days. Get feedback and analytics-ready data faster than ever before.

Record Deduplication

Longer description goes here, should span multiple lines.

Data normalization

Standardize data formats across different sources to create consistent, unified datasets.

Structured Extraction

Transform unstructured data from documents, web pages, or reviews into structured, analytics-ready formats.

Entity extraction

Identify and categorize key entities within large volumes of text for better organization and insight.

Metadata generation

Automatically create and apply descriptive metadata to your data assets for improved search and governance.

Error detection

Scan and identify inconsistencies or errors within your datasets at scale to improve data quality.

Record Deduplication

Longer description goes here, should span multiple lines.

Data normalization

Standardize data formats across different sources to create consistent, unified datasets.

Structured Extraction

Transform unstructured data from documents, web pages, or reviews into structured, analytics-ready formats.

Entity extraction

Identify and categorize key entities within large volumes of text for better organization and insight.

Metadata generation

Automatically create and apply descriptive metadata to your data assets for improved search and governance.

Error detection

Scan and identify inconsistencies or errors within your datasets at scale to improve data quality.

Record Deduplication

Longer description goes here, should span multiple lines.

Data normalization

Standardize data formats across different sources to create consistent, unified datasets.

Structured Extraction

Transform unstructured data from documents, web pages, or reviews into structured, analytics-ready formats.

Entity extraction

Identify and categorize key entities within large volumes of text for better organization and insight.

Metadata generation

Automatically create and apply descriptive metadata to your data assets for improved search and governance.

Error detection

Scan and identify inconsistencies or errors within your datasets at scale to improve data quality.

FAQ

What can I do with Sutro?

How does Sutro reduce costs?

How fast is Sutro for large jobs?

Do I need to manage my own infrastructure?

What tools does Sutro integrate with?

What Will You Scale with Sutro?