Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Use Case Example

Example of CMS driven use case

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

How sutro works for x

description for how sutro works for x

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

card 1 header

card 1 description

card 2 header

card 2 description

card 3 header

card 3 description

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema)

Improvement in Telegraph, #174,465, Alexander Graham Bell...

Electric Lamp, #223,898, Thomas Edison

Flying Machine, #821,393, Orville and Wilbur Wright

This patent describes the first creation of a working telephone system...

This patent pertains to the original invention of the lightbulb by Thomas Edison...

This patent was issued for the first heavier-than-air aircraft created by...

benefit 1

description for benefit 1

benefit 2

benefit 2

benefit 2

benefit 2 description

Progress: 1% | 1/2.5M Rows

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Data Orchestrators

Connect to Content

Add layers or components to infinitely loop on your page.

Object Storage and Open Data Formats

Connect to Content

Add layers or components to infinitely loop on your page.

Notebooks and Pythonic Coding Tools

Connect to Content

Add layers or components to infinitely loop on your page.

benefit 3 header

benefit 3 description

Sub use case section header

Sub use case section description

sub use case card 1 title

sub use case card 2 title

sub use case card 2 title

sub use case card 2 description

sub use case card 3 title

sub use case 3 description

sub use case card 4 title

card 4 description

sub use case card 5 title

sub use case 5 description

sub use case 6 title

sub use case 6 description

Sub use case section header

Sub use case section description

sub use case card 1 title

sub use case card 2 title

sub use case card 2 title

sub use case card 2 description

sub use case card 3 title

sub use case 3 description

sub use case card 4 title

card 4 description

sub use case card 5 title

sub use case 5 description

sub use case 6 title

sub use case 6 description

Sub use case section header

Sub use case section description

sub use case card 1 title

sub use case card 2 title

sub use case card 2 title

sub use case card 2 description

sub use case card 3 title

sub use case 3 description

sub use case card 4 title

card 4 description

sub use case card 5 title

sub use case 5 description

sub use case 6 title

sub use case 6 description

Related use cases header

related use case 1

description for related use case 1

related use case 2

description for related use case 2

related use case 3

description for related use case 3

related use case 4

description for related use case 4

related use case 5

related use case description 5

use case 6

related use case description 6

Related use cases header

related use case 1

description for related use case 1

related use case 2

description for related use case 2

related use case 3

description for related use case 3

related use case 4

description for related use case 4

related use case 5

related use case description 5

use case 6

related use case description 6

Related use cases header

related use case 1

description for related use case 1

related use case 2

description for related use case 2

related use case 3

description for related use case 3

related use case 4

description for related use case 4

related use case 5

related use case description 5

use case 6

related use case description 6

FAQ

question 1

question 2

question 3

question 4

Bottom CTA Header