Pricing

Documentation

Blog

Get Access

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Get Started

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Get Started

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

Loading Models

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Use Case Example

Example of CMS driven use case

Get Access

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

Loading Models

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Get Started

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

Loading Models

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Get Started

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

Loading Models

Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

How sutro works for x

description for how sutro works for x

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

card 1 header

card 1 description

card 2 header

card 2 description

card 3 header

card 3 description

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema)

Improvement in Telegraph, #174,465, Alexander Graham Bell...

Electric Lamp, #223,898, Thomas Edison

Flying Machine, #821,393, Orville and Wilbur Wright

This patent describes the first creation of a working telephone system...

This patent pertains to the original invention of the lightbulb by Thomas Edison...

This patent was issued for the first heavier-than-air aircraft created by...

benefit 1

description for benefit 1

benefit 2

benefit 2 description

Progress: 1% | 1/2.5M Rows

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Data Orchestrators

✨

Connect to Content

Add layers or components to infinitely loop on your page.

Object Storage and Open Data Formats

✨

Connect to Content

Add layers or components to infinitely loop on your page.

Notebooks and Pythonic Coding Tools

✨

Connect to Content

Add layers or components to infinitely loop on your page.

benefit 3 header

benefit 3 description

Sub use case section header

Sub use case section description

sub use case card 1 title

sub use case card 2 title

sub use case card 2 description

sub use case card 3 title

sub use case 3 description

sub use case card 4 title

card 4 description

sub use case card 5 title

sub use case 5 description

sub use case 6 title

sub use case 6 description

Sub use case section header

Sub use case section description

sub use case card 1 title

sub use case card 2 title

sub use case card 2 description

sub use case card 3 title

sub use case 3 description

sub use case card 4 title

card 4 description

sub use case card 5 title

sub use case 5 description

sub use case 6 title

sub use case 6 description

Sub use case section header

Sub use case section description

sub use case card 1 title

sub use case card 2 title

sub use case card 2 description

sub use case card 3 title

sub use case 3 description

sub use case card 4 title

card 4 description

sub use case card 5 title

sub use case 5 description

sub use case 6 title

sub use case 6 description

Related use cases header

related use case 1

description for related use case 1

related use case 2

description for related use case 2

related use case 3

description for related use case 3

related use case 4

description for related use case 4

related use case 5

related use case description 5

use case 6

related use case description 6

Related use cases header

related use case 1

description for related use case 1

related use case 2

description for related use case 2

related use case 3

description for related use case 3

related use case 4

description for related use case 4

related use case 5

related use case description 5

use case 6

related use case description 6

Related use cases header

related use case 1

description for related use case 1

related use case 2

description for related use case 2

related use case 3

description for related use case 3

related use case 4

description for related use case 4

related use case 5

related use case description 5

use case 6

related use case description 6

FAQ

question 1

question 2

question 3

question 4

Bottom CTA Header

Get Access

Blog

Documentation

Docs

team@sutro.sh