Cheapest LLM that's good enough.

We run over 150,000 large-language-model API calls per week and let the outputs be evaluated by a panel of leading models. Cost per token doesn't track quality — and when you say something like *"give me the model that's reliably 95% as good as the best one but most cost-effective"*, the cheapest qualifier typically saves north of 90% of the cost.

This benchmark publishes our results so other builders can answer the same question for their own pipelines: which model(s) are good enough and the most cost-efficient for each category of tasks. The data also shows that the most expensive model — even when your budget can pay for it — is not always the best performer.

Have a look. Tell us what you think. We can also run tests for you — reach us at llmbench@kapualabs.com.

Snapshot: 2026-05-21

Cost mode:

Right-sized per step

Cheapest qualifying model per capability, at your quality bar. Greyed cells mean no model qualifies on that category yet.

ModelFinancial Analysis & Trading DecisionsStructured Data & Fact ExtractionContent Summarization & SynthesisLong-form Content GenerationSocial & Promotional ContentRelevance, Classification & MatchingTopic Organization & ClusteringInfrastructure & UtilityQualifies on
Kimi K2.6 Moonshot AI2/5 58% cheaper 2/52/4 3/6 6% cheaper3/4 4/82/5 64% cheaper5/9 65% cheaper 23/35 · cheapest on 6
GPT-5.5 OpenAI3/5 1/53/62/46/9 2/53/920/35 · cheapest on 2
DeepSeek V4 Pro DeepSeek3/5 3/6 3% cheaper2/44/91/5 39% cheaper2/7 56% cheaper15/35 · cheapest on 1
Claude Opus 4.7 Anthropic2/51/54/62/41/10 2% cheaper1/4 3/614/35 · cheapest on 1
Qwen 3.6 Plus Alibaba Cloud (DashScope)1/5 77% cheaper 1/43/6 84% cheaper 2/31/11 94% cheaper2/5 87% cheaper 4/9 80% cheaper 14/35 · cheapest on 7
Gemini 3.1 Pro Preview Gemini1/51/41/4 56% cheaper6/12 1/5 60% cheaper2/9 60% cheaper12/35 · cheapest on 1
Claude Sonnet 4.6 Anthropic2/51/53/62/42/81/5 42% cheaper1/812/35
DeepSeek V4 Flash DeepSeek2/5 78% cheaper 1/6 99% cheaper 5/9 84% cheaper 1/5 95% cheaper1/9 97% cheaper10/35 · cheapest on 5
Qwen 3.5 Flash Alibaba Cloud (DashScope)1/4 88% cheaper 2/3 99% cheaper 2/9 99% cheaper 1/5 99% cheaper 3/9 98% cheaper 9/35 · cheapest on 9
Gemini 3 Flash Preview Gemini2/9 90% cheaper3/9 5/35 · cheapest on 1
Haiku 4.5 Anthropic1/6 80% cheaper1/111/4 81% cheaper3/35
MiniMax M2.5 MiniMax1/5 30% cheaper2/113/35
GPT-5.4 nano OpenAI1/5 37% cheaper1/4 1/8 59% cheaper 3/35 · cheapest on 2
Gemini 3.1 Flash Lite Gemini2/12 72% cheaper2/35
GPT-5.4 mini OpenAI1/4 84% cheaper1/112/35
Cost position within each category: cheapest qualifier most expensive qualifier Each cell shows qualifying tasks · average savings vs the best-performing model · for the cheapest qualifier in the category. Greyed cells: model doesn't qualify on any task in that category at the current bar.

By capability

Financial Analysis & Trading Decisions

Read business/financial docs deeply enough to make forward-looking judgments about opportunity and risk.

ModelTasksAvg costSaves
Qwen 3.6 Plus 1/5$0.0147277%
Kimi K2.6 2/5$0.0543458%
Claude Sonnet 4.62/5$0.10179
GPT-5.5 3/5$0.15379
Claude Opus 4.72/5$0.16965

Full category breakdown →

Structured Data & Fact Extraction

Precise pattern recognition and field-level information retrieval, strict schema adherence.

ModelTasksAvg costSaves
Qwen 3.5 Flash 1/4$0.0003788%
DeepSeek V4 Flash 2/5$0.0006278%
GPT-5.4 nano1/5$0.0019437%
MiniMax M2.51/5$0.0021530%
Qwen 3.6 Plus1/4$0.00307

See all 11 qualifying models →

Content Summarization & Synthesis

Compress long input into essential nuggets without losing material detail.

ModelTasksAvg costSaves
GPT-5.4 nano 1/4$0.00205
Kimi K2.6 2/4$0.01223
Gemini 3.1 Pro Preview1/4$0.01386

Full category breakdown →

Long-form Content Generation

Sustained compositional skill, voice consistency, coherent extended prose.

ModelTasksAvg costSaves
DeepSeek V4 Flash 1/6$0.0014299%
Qwen 3.6 Plus 3/6$0.0072184%
Kimi K2.63/6$0.016546%
DeepSeek V4 Pro3/6$0.023193%
Haiku 4.51/6$0.0239680%

See all 8 qualifying models →

Social & Promotional Content

Conciseness, platform-native conventions, engagement under tight character limits.

ModelTasksAvg costSaves
Qwen 3.5 Flash 2/3$0.0001199%
Qwen 3.6 Plus2/3$0.00103
GPT-5.4 mini1/4$0.0018884%
DeepSeek V4 Pro2/4$0.00396
Gemini 3.1 Pro Preview1/4$0.0050256%

See all 9 qualifying models →

Relevance, Classification & Matching

Semantic similarity judgment: does this thing belong in that bucket / match that target?

ModelTasksAvg costSaves
GPT-5.4 nano 1/8$0.0000559%
Qwen 3.5 Flash 2/9$0.0001399%
GPT-5.4 mini1/11$0.00017
Haiku 4.51/11$0.00021
Gemini 3.1 Flash Lite2/12$0.0004472%

See all 15 qualifying models →

Topic Organization & Clustering

Discover natural groupings without external schema, name and order them coherently.

ModelTasksAvg costSaves
Qwen 3.5 Flash 1/5$0.0000999%
DeepSeek V4 Flash1/5$0.0003495%
Gemini 3.1 Pro Preview1/5$0.0026660%
Qwen 3.6 Plus 2/5$0.0027187%
Haiku 4.51/4$0.0067981%

See all 10 qualifying models →

Infrastructure & Utility

Mechanical competence at format conversion, metadata manipulation, prompt rewriting, translation; minimal domain expertise required.

ModelTasksAvg costSaves
Qwen 3.5 Flash 3/9$0.0004098%
DeepSeek V4 Flash1/9$0.0007997%
Gemini 3 Flash Preview 3/9$0.00194
Qwen 3.6 Plus 4/9$0.0045880%
Gemini 3.1 Pro Preview2/9$0.0108660%

See all 10 qualifying models →

One model for your whole pipeline

For builders who want operational simplicity: ranked by fraction of tasks where the model is good enough, tie-broken by total pipeline cost.

#ModelTasks good enoughCheapest qualifier onStrengths
1Kimi K2.6 Moonshot AI23 / 35 66%6 tasksSocial & Promotional Content
2GPT-5.5 OpenAI20 / 35 57%2 tasks
3DeepSeek V4 Pro DeepSeek15 / 35 43%1 tasks
4Claude Opus 4.7 Anthropic14 / 35 40%1 tasks
5Qwen 3.6 Plus Alibaba Cloud (DashScope)14 / 35 40%7 tasks
6Gemini 3.1 Pro Preview Gemini12 / 35 34%1 tasks
7Claude Sonnet 4.6 Anthropic12 / 35 34%
8DeepSeek V4 Flash DeepSeek10 / 35 29%5 tasks
9Qwen 3.5 Flash Alibaba Cloud (DashScope)9 / 35 26%9 tasks
10Gemini 3 Flash Preview Gemini5 / 35 14%1 tasks
11Haiku 4.5 Anthropic3 / 35 9%
12MiniMax M2.5 MiniMax3 / 35 9%
13GPT-5.4 nano OpenAI3 / 35 9%2 tasks
14Gemini 3.1 Flash Lite Gemini2 / 35 6%
15GPT-5.4 mini OpenAI2 / 35 6%
16Gemini 3 Pro Image Preview Gemini0 / 35 0%
17Imagen 4.0 Gemini0 / 35 0%
18GPT-image-1.5 OpenAI0 / 35 0%
19Gemini 3.1 Flash Image Preview Gemini0 / 35 0%
20Imagen 4.0 Ultra Gemini0 / 35 0%
21Imagen 4.0 Fast Gemini0 / 35 0%

Default quality bar: 95% of the best-performing model on each task (chosen by knee detection on the savings curve — see methodology). Adjust the slider above to re-rank for a different tolerance.

Methodology, briefly

Quality scores come from LLM-judge verdicts on production workloads, on a 0–10 scale. Only MEDIUM-or-better cells qualify. The quality bar is relative to the best-performing model on each task — adjust the slider above to see your own pipeline.

Full methodology →

What changed this week

No changes recorded for this snapshot.

Full changelog →