Workhorse LLMs: Why Open Source Models Dominate Closed Source for Batch Tasks

Sutro Team

Jun 6, 2025

As LLMs become more prolific, we’ve noticed that teams still reach for closed sourced models like GPT, Claude, and Gemini for nearly every task. While this may have been the right call a year ago, teams today are unknowingly missing out on huge cost savings and performance gains by not considering open source alternatives.

It is true that at the frontier of intelligence, the most powerful closed source LLMs dominate their open source counterparts. However, many common tasks for LLMs don’t require PhD level reasoning. Instead, they require a workhorse LLM: something that’s reliable for low-to-medium difficulty tasks – things like classification, summarization, and data extraction.

Not only are there suitable replacements for closed source workhorses like GPT-4o-mini, but the equivalent replacements are often less expensive and more intelligent. When latency isn’t an issue, open source models have an even larger edge for cost savings if running jobs in bulk through a batch inference provider like Sutro.

In this guide, we compare the performance and cost of workhorse models. After the analysis, we provide a handy conversion chart that helps you pick the best open source replacement for the closed source models you may have used along with the cost savings you should expect by making the switch.

How Do Open Source LLMs Stack Up Against Closed Source LLMs?

A common question organizations ask us is how the cost and performance of open source models stack up against closed source models. To answer this question, it’s best to divide the field into two categories: frontier models and workhorse models.

Frontier models are the biggest, most capable models. They promise emergent capabilities, generalization across tasks, and the ability to handle complex context and instructions.

At the frontier, closed source models like Claude Opus 4.0, OpenAI’s o3 model, and Gemini 2.5 Pro dominate. However, open source models like Qwen 3 235B-A22B and DeepSeek R1 (and the to-be-released R2) are closing the gap quickly.

The closed providers also provide smaller, less-expensive models. These include GPT-4o mini, Gemini 2.5 Flash, and Claude Haiku 3.5. Many companies choose to adopt these models to save on costs when running tasks that require less general reasoning capabilities.  We refer to these as “workhorse” models, and believe that for this class of model open-source dominates. 

Workhorse models are extremely adept at most common business tasks, including:

  • Structured Extraction: Turning free form-text or images into JSON format and analytics-ready datasets.

  • Summarizing Text: Turning large documents or conversations into bullet points or summaries.

  • Answering Questions: Doing Q&A that doesn’t require complex reasoning on your documents.

  • Sentiment Analysis: Determining how positive, negative, or neutral a given text is.

  • Generating Synthetic Data: Creating datasets from scratch to power fine-tuning or improving RAG systems.

  • Data Extraction And Organization: Finding and extracting particular pieces of information from unstructured text.

  • Classification: Sorting text into categories specific to your applications.

  • Running Evals: Evaluating the robustness of your prompts using techniques like LLM-as-a-judge and metric scoring.

Most organizations already use models like GPT-4o-mini and Gemini 2.5 Flash to do these tasks to speed up inference and keep costs low. As we demonstrate below, equivalent open source LLMs can get you the same or better performance, sometimes with huge cost savings, especially if using a batch API like Sutro.

How Open Source LLMs For Common Business Tasks Stack Up

Figuring out how well an LLM performs is tricky. Most model creators game benchmarks by training for the test and testing using prompts that are carefully crafted to perform well. Nonetheless, because model providers are all trying to game the system in similar ways, benchmarks still give us a heuristic for relative capabilities.

To determine which workhorse open source models are likely a good replacement performance and cost wise for closed source models you have probably used, we will use Artificial Analysis's Intelligence Index. The Intelligence Index is a weighted average of benchmarks across many tasks. Using this lets us not overweight any one benchmark and helps us get the heuristic we are looking for.

Additionally, we pulled together the average cost per token for common workhorse tasks. We used a 10:1 blended ratio of input-to-output tokens for non-thinking models, which we calculated from internal metrics. For models with thinking mode, we budgeted 2k thinking tokens, adjusting the blended ratio to 3:2.

Finally, we calculated a performance-to-cost ratio - a "bang for buck" measurement - using the Artificial Analysis Index scores, giving us a unified metric with which to compare models.

LLM Benchmarks – Performance And Cost Comparison

Here, we provide benchmark comparisons, sorted by intelligence index, for common LLMs businesses should consider for workhorse tasks, along with average cost per million token (real-time, and batch pricing through Sutro for select models we offer).

Model

Type

Artificial Analysis Intelligence Index

Average Cost Per Million Tokens (Real-time/ API)

Average Cost Per Million Tokens (Batch API)

Performance-to-Cost Ratio (Real-Time)

Performance-to-Cost Ratio (Batch)

Gemini 2.5 Flash

Closed

  • Thinking: 65/100

  • No Thinking: 53/100

  • Thinking: $1.49/M

  • No Thinking: $0.191/M

  • Thinking: $0.745/M

  • No Thinking: $0.096/M

  • Thinking: 44

  • No thinking: 277

  • Thinking: 87

  • No thinking: 552

GPT-4o-mini

Closed

36/100

$0.764/M

$0.382/M

47

94

GPT-4.1-mini

Closed

53/100

$0.509/M

$0.255/M

104

207

Claude 3.5 Haiku

Closed

38/100

$1.273/M

$0.637/M

30

60

Qwen3 32B

Open

  • Thinking: 59/100

  • No Thinking: 44/100

$0.80/M

  • Thinking: $0.648/M

  • No Thinking: $0.161/M

  • Thinking: 74

  • No Thinking: 55

  • Thinking: 91

  • No Thinking: 273

Qwen3 14B

Open

  • Thinking: 55/100

  • No Thinking: 41/100

$0.30/M

  • Thinking: $0.185/M

  • No Thinking: $0.069/M

  • Thinking: 183

  • No Thinking: 140

  • Thinking: 297

  • No Thinking: 594

Qwen3 4B

Open

  • Thinking: 47/100

  • No Thinking: 35/100

$0.10/M

  • Thinking: $0.158/M

  • No Thinking: $0.035/M

  • Thinking: 470

  • No Thinking: 350

  • Thinking: 297

  • No Thinking: 1000

Llama 3.3 70B

Open

41/100

$0.54/M

$0.302/M

76

136

Llama 3.1 8B

Open

24/100

$0.10/M

$0.035/M

240

686

Gemma3 27B

Open

38/100

$0.80/M

$0.191/M

48

199

Gemma3 4B

Open

24/100

$0.10/M

$0.028/M

240

857

Gemini 2.5 Flash is the healthiest competitor to open source models. While thinking mode is often too expensive for workhorse tasks, its non-thinking mode is nearly as performant as the top open source models with similar pricing. Despite this, many organizations look for alternatives to proprietary models to prevent vendor lock-in and have greater control over security, data privacy, and deployment. Surprisingly, Qwen3 14B (not 32B!) offers slightly better performance at a similar cost profile.

Apart from Gemini 2.5 Flash, both performance and the performance-to-cost ratio clearly favors open-source models, especially when run using batch inference. Generally speaking, open-source models offer between 2x-10x the price/performance of their closed-source counterparts.

For example, the next best model, GPT-4.1-mini, underperforms Qwen3 14B on thinking mode and is more expensive.

Despite the benchmarks, many organizations still reach for GPT-4o-mini instead of Gemini 2.5 Flash or GPT-4.1-mini. However, open source offers even more options that deliver better performance at a lower cost, such as Gemma3 27B, Llama 3.3 70B, and the smaller Qwen3 series models. For example, Qwen3 4B without thinking not only offers comparable performance to GPT-4o-mini, but it also has a 10x better performance-to-cost ratio.

How To Choose An Open Source Replacement For Closed Source LLMs

Migrating between models always requires some work, such as adjusting prompts and re-testing performance on internal evals. However, based on performance and cost savings, we put together the following conversion chart with estimated cost savings.

Choosing a replacement is not always about a tit-for-tat cost analysis for the model with the closest performance. Many teams are willing to take a slight performance hit that can often be made up for with more complex prompting if it comes with cost savings. That's why we also include a "Performance Recovery" column, with suggested models that can yield better costs if small performance tradeoffs are acceptable.

Closed Source Model

Open Source Equivalent(s)

Performance Recovery

Estimated Cost Savings (Standard API)

Estimated Cost Savings (Batch API)

Gemini 2.5 Flash (No Thinking)

Qwen3 14B (Thinking)

>100%

N/A

N/A

Gemini 2.5 Flash (No Thinking)

Qwen3 4B (Thinking)

~90%

48%

21%

Gemini 2.5 Flash (No Thinking)

Qwen3 14B (No Thinking)

~80%

N/A

64%

GPT-4.1-mini

Qwen3 14B (Thinking)

>100%

40%

27%

GPT-4.1-mini

Qwen3 4B (Thinking)

~90%

80%

38%

GPT-4.1-mini

Qwen3 14B (No Thinking)

~80%

41%

73%

GPT-4o-mini

  • Qwen3 4B (No Thinking)

  • Gemma3 27B

>100%

87%

91%

Claude 3.5 Haiku

  • Qwen3 4B (No Thinking)

  • Gemma3 27B

>100%

92%

95%

With Gemini, the details are nuanced. While Gemini 2.5 Flash wins on a cost savings basis, Qwen3 14B offers better performance according to the blended benchmarks. Open source and closed source are neck and neck for the best workhorse models.

For all other models, not only is there a more powerful equivalent open source model to grab for, but often there are significant inference cost savings. These cost savings mean you can do more with LLMs that otherwise were cost-limited.

Conclusion

The debate between open source and closed source LLMs often focuses on frontier models but in doing so misses the true conversation: which models give the best cost-to-performance ratio for tasks they actually need to do.

For tasks like classification, summarization, and data extraction, businesses already use such workhorse models, but often steer towards the closed providers, unaware of the benefits of open-source. In this part of the price-to-performance space, open source heavily dominates.

By using real-time APIs, businesses can already find open source models with the same or better performance and 33%+ cost savings. Because many of these tasks are done in bulk – either by the dozens or by the thousands to millions – and are less latency sensitive, teams can save significantly more - sometimes as much as >90% - by switching to a batch API like Sutro.

If your team is looking to do more with LLMs and needs help figuring out how to maximize performance while minimizing cost, feel free to reach out to get a free consultation on how to best leverage the latest open source models.

Request early access to get started with Sutro