Cheapest LLM that's good enough.

We run over 150,000 large-language-model API calls per week and let the outputs be evaluated by a panel of leading models. Cost per token doesn't track quality — and when you say something like "give me the model that's reliably 95% as good as the best one but most cost-effective", the best-value model typically saves north of 90% of the cost.

This benchmark publishes our results so other builders can answer the same question for their own pipelines: which model(s) are good enough and the most cost-efficient for each category of tasks. The data also shows that the most expensive model — even when your budget can pay for it — is not always the best performer.

Have a look. Tell us what you think. We can also run tests for you — reach us at llmbench@kapualabs.com.

Snapshot: 2026-07-10

Right-sized per step

Best-value model per capability, at your quality bar. Greyed cells mean no model qualifies on that category yet.

Model	Financial Analysis & Trading Decisions	Structured Data & Fact Extraction	Content Summarization & Synthesis	Long-form Content Generation	Social & Promotional Content	Relevance, Classification & Matching	Topic Organization & Clustering	Infrastructure & Utility	Qualifies on
Qwen 3.6 Plus Alibaba Cloud (DashScope)	1.6x ★	14x	—	3.5x	6.6x	5.2x ★	3.2x	2.4x ★	14/52
Qwen 3.7 Plus Alibaba Cloud (DashScope)	1.2x	23x	1.4x	1.3x	—	2.2x ★	1.0x ★	—	11/52
Qwen 3.6 Flash Alibaba Cloud (DashScope)	1.0x ★	—	1.0x ★	1.2x ★	—	2.3x ★	—	—	7/52
Qwen 3.5 Flash Alibaba Cloud (DashScope)	—	—	—	—	1.3x	1.1x ★	1.0x ★	—	6/52
Claude Sonnet 5 Anthropic	3.4x	—	3.9x	4x	21x	6.6x ★	—	7.5x	12/52
Claude Sonnet 4.6 Anthropic	2.2x	5x	—	5.7x	—	14x	3.4x	4.8x	9/52
Claude Haiku 4.5 Anthropic	—	1.0x ★	—	1.3x	—	2.9x	1.1x	—	4/52
Claude Opus 4.8 Anthropic	—	—	7.6x	—	—	23x	—	—	3/52
DeepSeek V4 Pro DeepSeek	—	2x ★	—	1.0x ★	3.1x	3.3x	1.2x	1.0x ★	10/52
DeepSeek V4 Flash DeepSeek	—	1.0x ★	—	—	1.0x ★	1.0x ★	—	1.0x ★	6/52
Gemini 3.5 Flash Gemini	1.4x ★	6.6x	1.4x ★	2.7x	1.0x ★	2.8x ★	1.0x ★	3.2x ★	28/52
Gemini 3.1 Pro Preview Gemini	—	14x	—	3.5x	1.0x ★	6.7x	—	2.5x ★	8/52
Gemini 3.1 Flash Lite Gemini	—	—	—	—	—	1.1x	—	1.0x ★	4/52
Gemini 3 Pro Image Preview Gemini	—	—	—	—	—	—	—	1.6x	1/52
MiniMax M3 MiniMax	1.0x ★	3.3x	1.0x ★	1.1x ★	4.3x	2x ★	—	1.0x ★	14/52
Kimi K2.6 Moonshot AI	2.4x	10x	2.9x	4.1x	5.4x	9.5x	2.4x	2.7x	16/52
GPT-5.5 OpenAI	13x	—	—	22x	30x	44x	12x	11x	15/52
GPT-5.4 Nano OpenAI	—	2.4x	1.0x ★	—	—	1.3x	—	—	3/52
GPT-5.4 Mini OpenAI	—	—	—	—	2.3x	2.5x	—	—	2/52
GPT-image-2 OpenAI	—	—	—	—	—	—	—	1.0x ★	1/52
Grok 4.5 xAI	4.4x	—	4.6x	5.3x	—	7.9x	3.8x	—	15/52

Each cell shows how much you overpay vs the best-value good-enough model in that category: best value most expensive Cells show how much you overpay vs the best-value model clearing the bar (1.0x = this model is the best-value good-enough option, no overpayment; 3.2x = you pay ~3× more on average for the same quality). ★ marks any model that is the best-value pick on at least one of the category's tasks (multiple stars per column are possible — they're the per-task cost winners). Greyed cells: model doesn't qualify on any task in that category at the current bar. Rows grouped by provider; click a model name for the full per-task breakdown.

By capability

Financial Analysis & Trading Decisions

Read business/financial docs deeply enough to make forward-looking judgments about opportunity and risk.

Model	Tasks	Avg $/1M in	Overpay
Qwen 3.6 Flash ★	1/3	$0.58	best value
MiniMax M3 ★	2/4	$0.60	best value
Qwen 3.7 Plus	1/3	$0.72	1.2x
Qwen 3.6 Plus ★	2/4	$0.85	1.6x
Gemini 3.5 Flash ★	2/5	$1.04	1.4x

See all 10 qualifying models →

Structured Data & Fact Extraction

Precise pattern recognition and field-level information retrieval, strict schema adherence.

Model	Tasks	Avg $/1M in	Overpay
DeepSeek V4 Flash ★	1/3	$0.47	best value
Claude Haiku 4.5 ★	1/2	$0.91	best value
DeepSeek V4 Pro ★	3/4	$0.95	2x
GPT-5.4 Nano	1/2	$1.13	2.4x
MiniMax M3	2/4	$1.56	3.3x

See all 11 qualifying models →

Content Summarization & Synthesis

Compress long input into essential nuggets without losing material detail.

Model	Tasks	Avg $/1M in	Overpay
Qwen 3.6 Flash ★	1/2	$0.41	best value
GPT-5.4 Nano ★	1/3	$0.50	best value
Qwen 3.7 Plus	1/3	$0.57	1.4x
MiniMax M3 ★	1/3	$0.93	best value
Gemini 3.5 Flash ★	2/4	$1.37	1.4x

See all 9 qualifying models →

Long-form Content Generation

Sustained compositional skill, voice consistency, coherent extended prose.

Model	Tasks	Avg $/1M in	Overpay
DeepSeek V4 Pro ★	2/3	$0.81	best value
Claude Haiku 4.5	1/3	$1.30	1.3x
MiniMax M3 ★	4/4	$1.44	1.1x
Qwen 3.6 Flash ★	2/5	$1.80	1.2x
Qwen 3.7 Plus	3/5	$2.28	1.3x

See all 13 qualifying models →

Social & Promotional Content

Conciseness, platform-native conventions, engagement under tight character limits.

Model	Tasks	Avg $/1M in	Overpay
DeepSeek V4 Flash ★	1/4	$0.15	best value
Qwen 3.5 Flash	1/4	$0.20	1.3x
GPT-5.4 Mini	1/5	$0.36	2.3x
DeepSeek V4 Pro	1/5	$0.48	3.1x
MiniMax M3	1/4	$0.66	4.3x

See all 11 qualifying models →

Relevance, Classification & Matching

Semantic similarity judgment: does this thing belong in that bucket / match that target?

Model	Tasks	Avg $/1M in	Overpay
DeepSeek V4 Flash ★	3/10	$0.17	best value
GPT-5.4 Nano	1/9	$0.20	1.3x
Qwen 3.5 Flash ★	4/9	$0.34	1.1x
Gemini 3.1 Flash Lite	3/10	$0.39	1.1x
GPT-5.4 Mini	1/9	$0.40	2.5x

See all 19 qualifying models →

Topic Organization & Clustering

Discover natural groupings without external schema, name and order them coherently.

Model	Tasks	Avg $/1M in	Overpay
Qwen 3.5 Flash ★	1/3	$0.15	best value
Qwen 3.7 Plus ★	1/1	$0.45	best value
Claude Haiku 4.5	1/2	$0.47	1.1x
DeepSeek V4 Pro	1/3	$0.52	1.2x
Qwen 3.6 Plus	2/3	$0.71	3.2x

See all 10 qualifying models →

Infrastructure & Utility

Mechanical competence at format conversion, metadata manipulation, prompt rewriting, translation; minimal domain expertise required.

Model	Tasks	Avg $/1M in	Overpay
Gemini 3.1 Flash Lite ★	1/7	$0.41	best value
DeepSeek V4 Flash ★	1/5	$0.44	best value
MiniMax M3 ★	2/6	$0.63	best value
DeepSeek V4 Pro ★	1/6	$1.64	best value
Gemini 3.5 Flash ★	7/9	$3.24	3.2x

See all 13 qualifying models →

Methodology, briefly

Quality scores come from LLM-judge verdicts on production workloads, on a 0–10 scale. Only MEDIUM-or-better cells qualify. The quality bar is relative to the best-performing model on each task — adjust the slider above to see your own pipeline.

Full methodology →

What changed this week

What changed — 2026-07-07 → 2026-07-10

Tasks: 52 · Models: 22 (+1) · Evaluations: 662 (+41)

New models: Grok 4.5

Per-model movement (every model, 90% bar) — best value = tasks where the model is the best-value good-enough option; qualifying = tasks it’s good-enough on:

Model	Best Value	Qualifying
DeepSeek V4 Flash	18	18 (-1)
Gemini 3.5 Flash	7	36
MiniMax M3	6	24
Qwen 3.5 Flash	6	14 (-2)
Qwen 3.6 Flash	5 (+3)	16 (+2)
DeepSeek V4 Pro	4 (-2)	23
Qwen 3.7 Plus	2 (+1)	20 (+2)
Claude Haiku 4.5	1	9
GPT-5.4 Nano	1	6
Gemini 3.1 Flash Lite	1 (-2)	5 (-1)
Gemini 3.1 Flash Image Preview	1 (+1)	1 (-1)
GPT-5.5	0	29
Grok 4.5 (new)	0 (new)	26 (new)
Kimi K2.6	0	24 (+1)
Qwen 3.6 Plus	0	23
Claude Sonnet 5	0	22
Claude Sonnet 4.6	0	17 (-1)
Gemini 3.1 Pro Preview	0	15
Claude Opus 4.8	0	8 (+1)
GPT-5.4 Mini	0	5
Gemini 3 Pro Image Preview	0	1 (-1)
GPT-image-2	0 (-1)	1

Default quality bar 90% → 95%

Full changelog →