Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Product insight mining

Unlock Product Insights From Millions of Reviews in Hours

Easily sift through thousands of product reviews to unlock valuable insights while brewing your morning coffee. Sutro transforms massive amounts of free-form text into analytics-ready datasets without the pain of managing infrastructure or exploding costs.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

How to Mine Product Insights with Sutro

Sutro simplifies your entire workflow, from prototyping your analysis on a small sample to processing your entire dataset. Seamlessly connect to your existing data stack and get results fast.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Rapidly Prototype

Start small and iterate fast on your analysis workflow. Accelerate experiments by getting feedback from batch jobs on sample data in as little as minutes before committing to a large job.

Scale Effortlessly

Scale your analysis so your team can do more in less time. Process billions of tokens from millions of reviews in hours, not days, with no infrastructure headaches or exploding costs.

Integrate Seamlessly

Connect Sutro to your existing LLM workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster, and works with your existing notebooks and object storage.

Analyze at Unprecedented Scale

Confidently handle millions of requests, and billions of tokens at a time. Go from a small sample of reviews to your entire history without the pain of managing infrastructure.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and reduce costs by parallelizing your LLM calls through Sutro. Process your entire corpus of reviews for a fraction of the cost.

From Idea to Insights, Simplified

Run LLM batch jobs in hours, not days. Sutro takes the pain away from testing and scaling your analysis, unblocking your most ambitious AI projects.

Sentiment Analysis

Longer description goes here, should span multiple lines.

Structured Extraction

Transform unstructured data from web pages, documents, or reviews into structured insights that drive business decisions.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Document Summarization

Easily sift through thousands of documents and unlock valuable insights for your team to consume.

Personalize Content

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics to dramatically increase response rates.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Sentiment Analysis

Longer description goes here, should span multiple lines.

Structured Extraction

Transform unstructured data from web pages, documents, or reviews into structured insights that drive business decisions.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Document Summarization

Easily sift through thousands of documents and unlock valuable insights for your team to consume.

Personalize Content

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics to dramatically increase response rates.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

Sentiment Analysis

Longer description goes here, should span multiple lines.

Structured Extraction

Transform unstructured data from web pages, documents, or reviews into structured insights that drive business decisions.

Unstructured ETL

Convert massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Document Summarization

Easily sift through thousands of documents and unlock valuable insights for your team to consume.

Personalize Content

Tailor your marketing and advertising efforts to thousands of individuals, personas, and demographics to dramatically increase response rates.

Embedding Generation

Easily convert large corpuses of free-form text into vector representations for semantic search and recommendations.

FAQ

What is Sutro?

What kind of tasks can I perform with Sutro?

How does Sutro help reduce costs?

Can I integrate Sutro into my existing data pipelines?

How do I get started with Sutro?

What Will You Scale with Sutro?