Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Resume screening

Find top candidates in minutes, not weeks

Process thousands of resumes at once to identify the most qualified applicants. Sutro runs LLM batch jobs in hours, not days, at a fraction of the cost, simplifying your hiring pipeline.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

Batch LLM Inference is better with Sutro

Run LLM Batch Jobs in Hours, Not Days, at a Fraction of the Cost.

Generate a question/answer pair for the following chunk of vLLM documentation

Inputs

Outputs

Intro to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Loading Models

vLM models can be loaded in two different ways. To pass a loaded model into the vLLM framework for further processing and inference without reloading it from disk or a model hub, first start by generating


Using the Open AI Server

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in Run:ai Model Streamer Documentation.

vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. You first need to install vLLM RunAI optional dependency:

Question: Is vLLM compatible with all open-source models? ...

Question: How do I load a custom model from HuggingFace? ...

Question: Can I use the OpenAI compatible server to replace calls...

+128 more…

From Application Pile to Candidate Shortlist, Simplified

Sutro takes the pain away from testing and scaling LLM batch jobs. Use our simple Python SDK to transform your unstructured resume data into structured insights.

import sutro as so

from pydantic import BaseModel

class ReviewClassifier(BaseModel):

sentiment: str

user_reviews = '.

User_reviews.csv

User_reviews-1.csv

User_reviews-2.csv

User_reviews-3.csv

system_prompt = 'Classify the review as positive, neutral, or negative.'

results = so.infer(user_reviews, system_prompt, output_schema=ReviewClassifier)

Progress: 1% | 1/514,879 | Input tokens processed: 0.41m, Tokens generated: 591k

█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Prototype Your Screening Criteria

Start small and iterate fast. Accelerate experiments by testing your screening and extraction criteria on a small batch of resumes before committing to the full applicant pool.

Scale Your Screening

Scale your LLM workflows to process your entire applicant pipeline. Process billions of tokens in hours, not days, with no infrastructure headaches or exploding costs.

Integrate With Your ATS

Seamlessly connect Sutro to your existing hiring workflows. Sutro's Python SDK is compatible with popular data orchestration tools, like Airflow and Dagster.

Scale your outreach effortlessly

Confidently handle applicant pools of any size. Process millions of resumes and billions of tokens at a time without the pain of managing infrastructure.

Reduce recruiting costs by 10x

Reduce recruiting costs by 10x

Reduce recruiting costs by 10x

Get results faster and reduce costs by parallelizing LLM calls through Sutro. Stop paying for expensive, single-request processing and start saving with efficient batch jobs.

Shorten your time-to-hire

Shorten development cycles by getting feedback from large batches of resumes in minutes. Go from a mountain of applications to a qualified shortlist in hours.

Job Description Parsing

Longer description goes here, should span multiple lines.

Contact Info Extraction

Automatically pull key contact details from resumes and other documents to populate your systems.

Performance Review Summarization

Distill long performance reviews into concise summaries to quickly identify key themes and feedback.

Structured Extraction

Turn unstructured data from any document into structured insights that drive business decisions.

Document Tagging

Enrich your data with meaningful labels to improve data preparation and organization.

Synthetic Data Generation

Generate high-quality, representative data to improve model performance or fill statistical gaps.

Job Description Parsing

Longer description goes here, should span multiple lines.

Contact Info Extraction

Automatically pull key contact details from resumes and other documents to populate your systems.

Performance Review Summarization

Distill long performance reviews into concise summaries to quickly identify key themes and feedback.

Structured Extraction

Turn unstructured data from any document into structured insights that drive business decisions.

Document Tagging

Enrich your data with meaningful labels to improve data preparation and organization.

Synthetic Data Generation

Generate high-quality, representative data to improve model performance or fill statistical gaps.

Job Description Parsing

Longer description goes here, should span multiple lines.

Contact Info Extraction

Automatically pull key contact details from resumes and other documents to populate your systems.

Performance Review Summarization

Distill long performance reviews into concise summaries to quickly identify key themes and feedback.

Structured Extraction

Turn unstructured data from any document into structured insights that drive business decisions.

Document Tagging

Enrich your data with meaningful labels to improve data preparation and organization.

Synthetic Data Generation

Generate high-quality, representative data to improve model performance or fill statistical gaps.

FAQ

What is Sutro?

What can I do with Sutro?

How does Sutro save money?

Can I use Sutro with my existing tools?

How does Sutro handle large-scale jobs?

What Will You Scale with Sutro?