Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Data scraping

Crawl and Extract from Millions of Web Pages in Hours

Transform unstructured websites into analytics-ready structured datasets for your company or your customers. Run LLM batch jobs in hours, not days, at a fraction of the cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Purpose-Built for Scalable LLM Workflows

Ship faster results without complex infrastructure. Sutro's purpose-built tools help you scale up any LLM workflow, including data scraping and extraction.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your data extraction workflows. Accelerate experiments by testing your extraction schema on a sample of pages before committing to a large job.

Scale

Scale your scraping jobs so your team can do more in less time. Process billions of tokens from millions of pages in hours, not days, with no infrastructure headaches.

Integrate

Seamlessly connect Sutro to your existing data pipelines. Our Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale Effortlessly

Confidently handle millions of requests and billions of tokens at a time without the pain of managing infrastructure. Scale your web crawling from a handful of pages to the entire web.

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Reduce Costs by 10x or More

Get results faster and reduce costs by parallelizing your LLM calls through Sutro. Process massive amounts of web data without exploding costs.

From Idea to Insights, Simplified

Take the pain away from testing and scaling LLM batch jobs. Unblock your most ambitious data projects by getting from a scraping concept to a large-scale dataset, faster.

Structured Extraction

Longer description goes here, should span multiple lines.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Website data extraction

Extract specific information from websites to power your applications and analytics.

Contact info extraction

Automatically pull names, emails, and phone numbers from documents and web pages.

RAG data preparation

Easily convert large corpuses of text into vector representations for semantic search and retrieval.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights.

Structured Extraction

Longer description goes here, should span multiple lines.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Website data extraction

Extract specific information from websites to power your applications and analytics.

Contact info extraction

Automatically pull names, emails, and phone numbers from documents and web pages.

RAG data preparation

Easily convert large corpuses of text into vector representations for semantic search and retrieval.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights.

Structured Extraction

Longer description goes here, should span multiple lines.

Unstructured ETL

Convert your massive amounts of free-form text into analytics-ready datasets without the pains of managing your own infrastructure.

Website data extraction

Extract specific information from websites to power your applications and analytics.

Contact info extraction

Automatically pull names, emails, and phone numbers from documents and web pages.

RAG data preparation

Easily convert large corpuses of text into vector representations for semantic search and retrieval.

Product insight mining

Easily sift through thousands of product reviews and unlock valuable product insights.

FAQ

What kind of tasks is Sutro good for?

How does Sutro save money?

How does Sutro handle large jobs?

Does Sutro work with my existing tools?

How do I interact with Sutro?

What Will You Scale with Sutro?