(No) Need For Speed: Why Batch LLM Inference is Often the Smarter Choice

Pricing

Documentation

Blog

Case Studies

Researchers

Get Access

<- Back to blog

(No) Need For Speed: Why Batch LLM Inference is Often the Smarter Choice

Sutro Team

Jun 15, 2025

Introduction

If your LLM task does not require an immediate, user-facing response, there is a good chance you are paying too much for inference and using a cumbersome workflow.

Most teams reach for a synchronous API from providers like OpenAI or Anthropic when building applications with LLMs. Synchronous APIs let AI teams get responses back from LLMs within seconds. While some tasks – like customer support – are latency sensitive, most are not. Tasks like classification, data enrichment, and document processing usually happen in bulk and are not as time sensitive as chat applications.

Processing less latency-sensitive tasks in bulk – also known as batch inference – has several benefits. Some benefits, like cost savings due to more efficient resource management, are obvious. Others – like speed, code simplicity, and rate-limit management – are not.

In this article, we argue that many common tasks with LLMs should use a batch inference service instead of a synchronous API. Then, we compare the workflows for doing bulk tasks with synchronous versus batch APIs to demonstrate the hidden benefits of batch inference. Finally, we cover the main benefits of batch inference, which includes significant cost savings, hidden time savings, and a more developer friendly workflow.

Synchronous vs. Batch APIs: How To Choose

Choosing between a real-time and a batch API is similar to the classic database decision between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). Such a decision has become fairly obvious to most technical buyers over the last decade, but the same standards and offerings have not caught up for LLM inference.

OLTP systems handle real-time, high-volume transactions, like those in online banking or e-commerce. OLAP systems, on the other hand, are designed for analyzing large datasets to extract insights and support decision-making.

One is for immediate, interactive requests, while the other is for processing large volumes of data efficiently in the background.

Transactional Queries: (OLTP): These are synchronous. Your application sends a request and expects a nearly instant response. This is essential for interactive applications like SaaS and social media applications .
Analytical Queries (OLAP): These are often asynchronous or longer-running jobs. You submit a query, and retrieve the results later—minutes or hours, depending on the size. This is designed for throughput and cost-efficiency, not low latency.

Because analytical services run asynchronously, they offer significant cost savings and/or larger-scale processing capabilities with the tradeoff of latency.

Similarly, for LLMs, here’s how to decide when to reach for a synchronous API or batch API:

Feature	Synchronous API	Batch API (Asynchronous)
Analogy	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Latency	Low (milliseconds to seconds)	High (minutes to hours)
Cost	Higher per request/token	Significantly lower per request/token (50-90%+ discount)
Workload	Interactive, unpredictable, single requests	Large volume, predictable, bulk offline tasks
Interaction	Request → Wait → Immediate Response	Submit Job → Poll/Callback → Results Later
Common Adjacent Tools	Web frameworks (Flask, FastAPI), Postgres, Chatbot platforms	Data orchestrators (Airflow, Dagster, Prefect), Object Storage, Workflow engines

If your user needs an answer now, use a real-time API. For almost everything else, a Batch API is the simpler, more economical choice.

What Can I Actually Do with Batch Inference?

Many valuable LLM applications become far more feasible and scalable with batch processing. They typically fall into one of two categories: enterprise-oriented operational tasks, and research-oriented discovery tasks.

Examples of operational tasks might include:

Document Summarization & Analysis: Process entire archives of reports, legal documents, or research papers.
Data Classification, Extraction, and Enrichment: Tag, categorize, and extract structured information from large, unstructured text datasets (e.g., enriching product catalogs or structuring web page content).
Retrieval-Augmented Generation (RAG) Data Preparation: Efficiently generate embeddings for document corpora to power your RAG systems.
Content Moderation: Review user-generated content for policy violations.
Product Catalog Enhancements: Improving and aligning entire catalogues of product descriptions and images, all at once.
Content Transcription, Translation, Localization, and Personalization: Using AI models to transform marketing content and improve conversion rates.

Research-oriented tasks might include:

(Synthetic) Data Generation: Create large, diverse datasets for training custom models or improving RAG system performance.
Offline Model Evaluation: Running a number of tests against a model to determine its capabilities across a specific criteria set.
Simulations: Utilizing an AI model across a large domain of inputs for scientific discovery in fields ranging from social science, to economics, to drug discovery, and more.
Sentiment Analysis: Analyze sentiment across customer feedback, social media mentions, or survey responses.

Note that the size of the task doesn’t matter when deciding to use a batch API. Even processing a few dozen inputs with the same prompt is better architected using a batch API.

Why Real-Time Isn’t The Right Form Factor For Batch Jobs

Our company, Sutro, is a batch AI inference provider. However, we have been surprised to find that many customers we’ve worked with and spoken to are still doing batch tasks with synchronous LLM inference APIs.

So why bother switching to batch when you’ve already architected your batch jobs around synchronous APIs?

The answer: simplicity, cost savings, and time savings.

We often see customers building brittle, custom-made tooling to accomplish batch inference tasks. This can be as simple as a single for loop or as complicated as a custom-built load balancer with a pool of rotating keys, exponential backoffs/retries, caching mechanisms, and task queues.

We’ve spoken with customers who let such pipelines run for weeks at a time. Plus, these pipelines can take hundreds of engineering hours to build, cost significantly more to operate, and riddle a company’s codebase with technical debt.

Below, we give you a sense for what you would need to implement in Python to roll your own batch pipeline with synchronous APIs:

import time, random, requests
from collections import deque
from typing import List
class BrittleBatchInference:
    def __init__(self):
        self.api_keys = deque(['key1', 'key2', 'key3'])
        self.cache, self.failed_items = {}, []
    
    def process_item(self, item):
        if item in self.cache: return self.cache[item]  # Cache hit
        
        for attempt in range(5):
            try:
                self.api_keys.rotate(1)  # Rotate keys for load balancing
                response = requests.post('https://api.example.com/inference', 
                                       headers={'Authorization': f'Bearer {self.api_keys[0]}'}, json={'data': item})
                
                if response.status_code == 200:
                    result = response.json()
                    self.cache[item] = result
                    return result
                elif response.status_code == 429:  # Rate limited - exponential backoff with jitter
                    time.sleep(min(300, (2 ** attempt) + random.uniform(0, 1)))
                
            except Exception as e:
                print(f"Attempt {attempt + 1} failed: {e}")
                time.sleep(2 ** attempt)  # Exponential backoff on error
        
        self.failed_items.append(item)  # Add to retry queue
        return None
    
    def run_batch(self, items: List[str]):
        results = [r for item in items if (r := self.process_item(item)) is not None]
        
        # Infinite retry loop - could run for weeks!
        while self.failed_items:
            print(f"Retrying {len(self.failed_items)} failed items...")
            retry_batch, self.failed_items = self.failed_items.copy(), []
            results.extend([r for item in retry_batch if (r := self.process_item(item)) is not None])
            time.sleep(60)  # Wait before next retry cycle
        
        return results

# Usage - processes 1000 items with all the brittle complexity above
pipeline = BrittleBatchInference()
results = pipeline.run_batch(my_input_list)

Even with all the hacks in place, a custom batch pipeline built around synchronous LLM API calls runs into many issues:

Speed: you have to process one request at a time, which is very slow.
Errors: one error can break the entire loop. The client needs to be coded to handle retries.
Rate Limits: it is too easy to hit rate limits, so multiple API keys and self-imposed limits on API calls must be implemented.
Scale: hitting rate limits means there is a ceiling on how high you can scale your application, both for speed and volume.
Costs: while parallelization is possible to some extent, you are still paying a real-time rate for a batch job, which limits how much you can do with LLMs without breaking the bank.

You can circumvent all these issues by using a batch API endpoint. For example, here’s how you would run the same batch pipeline above using Sutro:

import sutro as so

job_id = so.infer(
    inputs=my_input_list,
    job_priority=1
)

results = so.await_job_results(job_id, timeout=3600)

Unlike the brittle batch pipeline, there is no rate limit handling, retries, and looping through inputs. That’s because batch APIs do all of this behind the scenes. Moreover, batch APIs from any provider can save you 50% to 90%+ and complete jobs in 10x or more time by efficiently parallelizing inference.

In addition to the simplicity, batch providers often deliver your results into a format appropriate for reviewing outputs all at once (e.g. CSV, XLSX, JSONL). Some, like Sutro, even provide interfaces to directly invoke, observe, and review jobs.

Behind The Scenes: Why Batch Inference Saves Costs And Time

Batch API providers don’t just parallelize synchronous requests better: they’re architected in a fundamentally different way. It’s because of this that batch APIs can save costs and avoid rate limits.

To keep latency low, synchronous API providers need to maintain a fleet of “always-on” GPUs. Batch APIs, on the other hand, are not subject to the same constraints. Instead, batch providers can take advantage of spare GPU capacity and autoscale hardware utilization based on the jobs in its queue. Using spare GPUs and waiting to autoscale means batch providers can get lower prices on hardware.

Additionally, batch providers can reduce costs even more and avoid rate limits by optimizing for throughput. Synchronous APIs must process requests as they come in to keep latency low. On the other hand, batch APIs can afford to wait to maximize GPU utilization to ensure no resources are put to waste.

Who's Who in the Batch API World?

Several players offer batch processing, but the landscape is divided between proprietary and open-source models.

1. Major Proprietary Model Providers

OpenAI, Anthropic, and cloud providers like Amazon Bedrock, Azure ML, and Google Vertex AI all offer batch APIs for their proprietary models. They typically provide a 50% discount compared to their real-time rates. You upload a file of requests, and results are generally available within 24 hours. Smaller jobs can run in significantly less time.

Other data platforms like Snowflake and Databricks offer such functionality, but often at a steep cost premium or very low quota limits.

2. Specialized Batch Platforms for Open Source

While proprietary models are powerful, open-source models often deliver superior cost-to-performance for the "workhorse" tasks perfect for batch processing. Nonetheless, most open source APIs like Together.ai do not provide batch offerings.

This is where Sutro comes in.

Sutro is a platform built specifically to make batch inference with open-source LLMs easy and incredibly cheap. It bridges the gap by offering:

Massive Cost Savings (50-90%+): By leveraging optimized open-source “workhorse” foundation models like Llama 3.3, Gemma 3, and Qwen3, Sutro can achieve dramatic cost reductions compared to using closed-source real-time APIs.
Custom Model Hosting: Sutro also offers the ability to host custom models, allowing teams with highly specific needs to easily run their offline inference tooling more efficiently.
No Infrastructure Headaches: Sutro handles all the complexity of scaling batch jobs for open-source models, so your team doesn't have to.
The Right Form Factor: Sutro is more than a simple API. Sutro provides customers with the interfaces to invoke, observe, and review batch jobs runs at a bird’s eye glance, something that you can’t do when running a “for loop” through all your inputs with a real-time API.
Seamless Integration: Sutro's Python SDK is designed to work with the data orchestration tools you already use, like Airflow, Prefect, and Dagster.
Better Security & No Lock-In: Because Sutro doesn’t train foundation models, we don't train on or retain your data for longer than customers need. We also offer tiers of zero data visibility deployments for enterprise customers. Plus, using open-source models also means you aren't tied to a single provider (including Sutro!).

Start Building with Batch in Mind

Batch APIs aren't a niche feature; they are a fundamental tool for any organization serious about using LLMs at scale. For any workload that doesn't need an instant response, batch processing offers a path to get it done without the premium price tag, rate limits, or brittle code.

Whether you're architecting a new project or re-evaluating an existing one, ask yourself: "Do I really need this in real-time?" If the answer is no, you should consider using a batch API.

Specialized platforms like Sutro have made using open-source models for batch jobs more accessible and cost-effective than ever. By embracing asynchronous processing, you can scale and power your most ambitious AI projects in ways real-time APIs aren’t built for.

If you’re building any batch tasks with LLMs, feel free to reach out to see how Sutro can help.

‹ The End of Moore's Law for AI? Gemini Flash Offers a Warning

Workhorse LLMs: Why Open Source Models Dominate Closed Source for Batch Tasks ›

Request early access to get started with Sutro

Get Access

team@sutro.sh

Classification

Data Scraping

Synthetic Data Generation

Bulk Content Generation

View All

Blog

Documentation