Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Fraud detection

Detect Fraud Across Millions of Transactions, Faster and For Less

Run LLM batch jobs to analyze millions of transactions and user activities in hours, not days, at a fraction of the cost. Sutro takes the pain away from testing and scaling LLM batch jobs to unblock your most ambitious AI projects.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Raw Data to Actionable Insights, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs for fraud detection, letting you focus on protecting your business.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype

Start small and iterate fast on your fraud detection workflows. Accelerate experiments by testing on Sutro before committing to large jobs analyzing millions of records.

Scale

Scale your fraud detection workflows so your team can do more in less time. Process billions of tokens from transaction logs or user reports in hours, not days, with no infrastructure headaches or exploding costs.

Integrate

Seamlessly connect Sutro to your existing fraud detection and data workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale your detection effortlessly

Confidently handle millions of requests, and billions of tokens at a time without the pain of managing infrastructure. Process entire transaction histories or user logs in a single batch job.

Reduce fraud detection costs

Reduce fraud detection costs

Reduce fraud detection costs

Get results faster and reduce costs by 10x or more by parallelizing your LLM calls through Sutro for any fraud analysis workflow.

Identify threats faster

Shorten development cycles by getting feedback from large batch jobs in as little as minutes before scaling up. Run LLM batch jobs in hours, not days.

Enrich Data

Longer description goes here, should span multiple lines.

Structure Web Pages

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Personalize Content

Tailor your marketing and advertising efforts to thousands, or millions of individuals, personas, and demographics to dramatically increase response rates and ad conversions.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance, without the complexity.

Enrich Data

Longer description goes here, should span multiple lines.

Structure Web Pages

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Personalize Content

Tailor your marketing and advertising efforts to thousands, or millions of individuals, personas, and demographics to dramatically increase response rates and ad conversions.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance, without the complexity.

Enrich Data

Longer description goes here, should span multiple lines.

Structure Web Pages

Crawl millions of web pages, and extract analytics-ready datasets for your company or your customers.

Unlock Product Insights

Easily sift through thousands of product reviews and unlock valuable product insights while brewing your morning coffee.

Personalize Content

Tailor your marketing and advertising efforts to thousands, or millions of individuals, personas, and demographics to dramatically increase response rates and ad conversions.

LLM performance evaluation

Benchmark your LLM outputs to continuously improve workflows, agents and assistants, or easily evaluate custom models against a new use-case.

Synthetic data generation

Generate high-quality, diverse, and representative synthetic data to improve model or RAG retrieval performance, without the complexity.

FAQ

What is Sutro?

How does Sutro help reduce costs?

What kind of tasks can I perform with Sutro?

How do I integrate Sutro into my existing workflow?

Can I test my workflows before running a large job?

What Will You Scale with Sutro?