Cheapest LLM that's good enough.
We run over 150,000 large-language-model API calls per week and let the outputs be evaluated by a panel of leading models. Cost per token doesn't track quality — and when you say something like *"give me the model that's reliably 95% as good as the best one but most cost-effective"*, the cheapest qualifier typically saves north of 90% of the cost.
This benchmark publishes our results so other builders can answer the same question for their own pipelines: which model(s) are good enough and the most cost-efficient for each category of tasks. The data also shows that the most expensive model — even when your budget can pay for it — is not always the best performer.
Have a look. Tell us what you think. We can also run tests for you — reach us at llmbench@kapualabs.com.
Snapshot: 2026-05-21
Right-sized per step
Cheapest qualifying model per capability, at your quality bar. Greyed cells mean no model qualifies on that category yet.
By capability
Financial Analysis & Trading Decisions
Read business/financial docs deeply enough to make forward-looking judgments about opportunity and risk.
| Model | Tasks | Avg cost | Saves |
|---|---|---|---|
| Qwen 3.6 Plus ★ | 1/5 | $0.01472 | 77% |
| Kimi K2.6 ★ | 2/5 | $0.05434 | 58% |
| Claude Sonnet 4.6 | 2/5 | $0.10179 | — |
| GPT-5.5 ★ | 3/5 | $0.15379 | — |
| Claude Opus 4.7 | 2/5 | $0.16965 | — |
Structured Data & Fact Extraction
Precise pattern recognition and field-level information retrieval, strict schema adherence.
| Model | Tasks | Avg cost | Saves |
|---|---|---|---|
| Qwen 3.5 Flash ★ | 1/4 | $0.00037 | 88% |
| DeepSeek V4 Flash ★ | 2/5 | $0.00062 | 78% |
| GPT-5.4 nano | 1/5 | $0.00194 | 37% |
| MiniMax M2.5 | 1/5 | $0.00215 | 30% |
| Qwen 3.6 Plus | 1/4 | $0.00307 | — |
Content Summarization & Synthesis
Compress long input into essential nuggets without losing material detail.
| Model | Tasks | Avg cost | Saves |
|---|---|---|---|
| GPT-5.4 nano ★ | 1/4 | $0.00205 | — |
| Kimi K2.6 ★ | 2/4 | $0.01223 | — |
| Gemini 3.1 Pro Preview | 1/4 | $0.01386 | — |
Long-form Content Generation
Sustained compositional skill, voice consistency, coherent extended prose.
| Model | Tasks | Avg cost | Saves |
|---|---|---|---|
| DeepSeek V4 Flash ★ | 1/6 | $0.00142 | 99% |
| Qwen 3.6 Plus ★ | 3/6 | $0.00721 | 84% |
| Kimi K2.6 | 3/6 | $0.01654 | 6% |
| DeepSeek V4 Pro | 3/6 | $0.02319 | 3% |
| Haiku 4.5 | 1/6 | $0.02396 | 80% |
Social & Promotional Content
Conciseness, platform-native conventions, engagement under tight character limits.
| Model | Tasks | Avg cost | Saves |
|---|---|---|---|
| Qwen 3.5 Flash ★ | 2/3 | $0.00011 | 99% |
| Qwen 3.6 Plus | 2/3 | $0.00103 | — |
| GPT-5.4 mini | 1/4 | $0.00188 | 84% |
| DeepSeek V4 Pro | 2/4 | $0.00396 | — |
| Gemini 3.1 Pro Preview | 1/4 | $0.00502 | 56% |
Relevance, Classification & Matching
Semantic similarity judgment: does this thing belong in that bucket / match that target?
| Model | Tasks | Avg cost | Saves |
|---|---|---|---|
| GPT-5.4 nano ★ | 1/8 | $0.00005 | 59% |
| Qwen 3.5 Flash ★ | 2/9 | $0.00013 | 99% |
| GPT-5.4 mini | 1/11 | $0.00017 | — |
| Haiku 4.5 | 1/11 | $0.00021 | — |
| Gemini 3.1 Flash Lite | 2/12 | $0.00044 | 72% |
Topic Organization & Clustering
Discover natural groupings without external schema, name and order them coherently.
| Model | Tasks | Avg cost | Saves |
|---|---|---|---|
| Qwen 3.5 Flash ★ | 1/5 | $0.00009 | 99% |
| DeepSeek V4 Flash | 1/5 | $0.00034 | 95% |
| Gemini 3.1 Pro Preview | 1/5 | $0.00266 | 60% |
| Qwen 3.6 Plus ★ | 2/5 | $0.00271 | 87% |
| Haiku 4.5 | 1/4 | $0.00679 | 81% |
Infrastructure & Utility
Mechanical competence at format conversion, metadata manipulation, prompt rewriting, translation; minimal domain expertise required.
| Model | Tasks | Avg cost | Saves |
|---|---|---|---|
| Qwen 3.5 Flash ★ | 3/9 | $0.00040 | 98% |
| DeepSeek V4 Flash | 1/9 | $0.00079 | 97% |
| Gemini 3 Flash Preview ★ | 3/9 | $0.00194 | — |
| Qwen 3.6 Plus ★ | 4/9 | $0.00458 | 80% |
| Gemini 3.1 Pro Preview | 2/9 | $0.01086 | 60% |
One model for your whole pipeline
For builders who want operational simplicity: ranked by fraction of tasks where the model is good enough, tie-broken by total pipeline cost.
| # | Model | Tasks good enough | Cheapest qualifier on | Strengths |
|---|---|---|---|---|
| 1 | Kimi K2.6 Moonshot AI | 23 / 35 66% | 6 tasks | Social & Promotional Content |
| 2 | GPT-5.5 OpenAI | 20 / 35 57% | 2 tasks | — |
| 3 | DeepSeek V4 Pro DeepSeek | 15 / 35 43% | 1 tasks | — |
| 4 | Claude Opus 4.7 Anthropic | 14 / 35 40% | 1 tasks | — |
| 5 | Qwen 3.6 Plus Alibaba Cloud (DashScope) | 14 / 35 40% | 7 tasks | — |
| 6 | Gemini 3.1 Pro Preview Gemini | 12 / 35 34% | 1 tasks | — |
| 7 | Claude Sonnet 4.6 Anthropic | 12 / 35 34% | — | — |
| 8 | DeepSeek V4 Flash DeepSeek | 10 / 35 29% | 5 tasks | — |
| 9 | Qwen 3.5 Flash Alibaba Cloud (DashScope) | 9 / 35 26% | 9 tasks | — |
| 10 | Gemini 3 Flash Preview Gemini | 5 / 35 14% | 1 tasks | — |
| 11 | Haiku 4.5 Anthropic | 3 / 35 9% | — | — |
| 12 | MiniMax M2.5 MiniMax | 3 / 35 9% | — | — |
| 13 | GPT-5.4 nano OpenAI | 3 / 35 9% | 2 tasks | — |
| 14 | Gemini 3.1 Flash Lite Gemini | 2 / 35 6% | — | — |
| 15 | GPT-5.4 mini OpenAI | 2 / 35 6% | — | — |
| 16 | Gemini 3 Pro Image Preview Gemini | 0 / 35 0% | — | — |
| 17 | Imagen 4.0 Gemini | 0 / 35 0% | — | — |
| 18 | GPT-image-1.5 OpenAI | 0 / 35 0% | — | — |
| 19 | Gemini 3.1 Flash Image Preview Gemini | 0 / 35 0% | — | — |
| 20 | Imagen 4.0 Ultra Gemini | 0 / 35 0% | — | — |
| 21 | Imagen 4.0 Fast Gemini | 0 / 35 0% | — | — |
Default quality bar: 95% of the best-performing model on each task (chosen by knee detection on the savings curve — see methodology). Adjust the slider above to re-rank for a different tolerance.
Methodology, briefly
Quality scores come from LLM-judge verdicts on production workloads, on a 0–10 scale. Only MEDIUM-or-better cells qualify. The quality bar is relative to the best-performing model on each task — adjust the slider above to see your own pipeline.
What changed this week
No changes recorded for this snapshot.