Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Document tagging

Tag millions of documents in hours, not days

Automatically organize your data into meaningful categories without involving your ML engineer. Sutro lets you run LLM batch jobs at a fraction of the cost, turning unstructured data into structured insights.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From messy documents to structured tags, simplified

Sutro's Python SDK simplifies the entire process of tagging large document sets. Connect to your data and let Sutro handle the complexity of scaling your LLM workflows.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your document tagging workflows. Accelerate experiments by testing on Sutro before committing to large jobs.

Scale

Scale your tagging workflows so your team can do more in less time. Process billions of tokens in hours, not days, with no infrastructure headaches.

Integrate

Seamlessly connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale your organization effortlessly

Confidently handle millions of documents and billions of tokens at a time. Sutro removes the pain of managing infrastructure so you can focus on results.

Reduce costs by 10x or more

Reduce costs by 10x or more

Reduce costs by 10x or more

Get results faster and reduce costs by parallelizing your LLM calls through Sutro. Process massive datasets without exploding costs.

Shorten development cycles

Get feedback from large batch jobs in as little as minutes before scaling up. Rapidly prototype and iterate on your document tagging workflows.

Text classification

Longer description goes here, should span multiple lines.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Metadata generation

Enrich your data with meaningful labels to improve model training and data preparation.

Entity extraction

Extract specific information from large volumes of free-form text to enrich datasets.

Document summarization

Condense large documents into concise summaries to quickly gather key insights.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

Text classification

Longer description goes here, should span multiple lines.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Metadata generation

Enrich your data with meaningful labels to improve model training and data preparation.

Entity extraction

Extract specific information from large volumes of free-form text to enrich datasets.

Document summarization

Condense large documents into concise summaries to quickly gather key insights.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

Text classification

Longer description goes here, should span multiple lines.

Structured Extraction

Transform unstructured data into structured insights that drive business decisions.

Metadata generation

Enrich your data with meaningful labels to improve model training and data preparation.

Entity extraction

Extract specific information from large volumes of free-form text to enrich datasets.

Document summarization

Condense large documents into concise summaries to quickly gather key insights.

RAG data preparation

Easily convert large corpuses of free-form text into vector representations for semantic search.

FAQ

What is Sutro?

How does Sutro help reduce costs?

What kind of tasks can I perform with Sutro?

Does Sutro integrate with my existing tools?

How does Sutro handle large-scale jobs?

What Will You Scale with Sutro?