Cheapest LLM that's good enough.

We run over 150,000 large-language-model API calls per week and let the outputs be evaluated by a panel of leading models. Cost per token doesn't track quality — and when you say something like "give me the model that's reliably 95% as good as the best one but most cost-effective", the best-value model typically saves north of 90% of the cost.

This benchmark publishes our results so other builders can answer the same question for their own pipelines: which model(s) are good enough and the most cost-efficient for each category of tasks. The data also shows that the most expensive model — even when your budget can pay for it — is not always the best performer.

Have a look. Tell us what you think. We can also run tests for you — reach us at llmbench@kapualabs.com.

Snapshot: 2026-07-20

Right-sized per step

Best-value model per capability, at your quality bar. Greyed cells mean no model qualifies on that category yet.

Model	Financial Analysis & Trading Decisions	Structured Data & Fact Extraction	Content Summarization & Synthesis	Long-form Content Generation	Social & Promotional Content	Relevance, Classification & Matching	Topic Organization & Clustering	Infrastructure & Utility	Qualifies on
Qwen 3.6 Plus Alibaba Cloud (DashScope)	3.1x	52x	16x	15x	22x	20x	4.7x	15x	24/52
Qwen 3.7 Plus Alibaba Cloud (DashScope)	2.7x	56x	7.3x	7.2x	12x	17x	2.6x	11x	21/52
Qwen 3.6 Flash Alibaba Cloud (DashScope)	3.1x ★	31x	10x	9.4x	24x	33x ★	4.3x	15x	17/52
Qwen 3.5 Flash Alibaba Cloud (DashScope)	1.0x ★	1.1x	—	—	7.4x	9.2x	2.5x	3.8x	14/52
Claude Sonnet 5 Anthropic	7.7x	16x	39x	13x	18x	24x	8.7x	17x ★	24/52
Claude Sonnet 4.6 Anthropic	12x	16x	27x	25x	14x	28x	9.8x	35x	19/52
Claude Opus 4.8 Anthropic	12x	26x	34x	—	22x	69x	—	35x	12/52
Claude Haiku 4.5 Anthropic	7.9x	—	8.1x	15x	—	11x	3.1x	—	9/52
DeepSeek V4 Pro DeepSeek	1.1x	11x	5.4x	4.5x ★	4.3x	3.6x ★	2.1x	5.8x	22/52
DeepSeek V4 Flash DeepSeek	—	1.0x ★	1.0x ★	1.8x	1.0x ★	1.1x ★	1.3x ★	1.4x ★	18/52
Gemini 3.5 Flash Gemini	5.6x	45x	7.6x ★	14x	11x	14x ★	10x	17x	38/52
Gemini 3.1 Pro Preview Gemini	—	83x	15x	19x	13x	18x	5.5x	14x	16/52
Gemini 3.1 Flash Lite Gemini	—	1.2x	—	—	—	1.4x ★	—	1.0x ★	6/52
Gemini 3 Pro Image Preview Gemini	—	—	—	—	—	—	—	1.6x	1/52
Gemini 3.1 Flash Image Preview Gemini	—	—	—	—	—	—	—	1.0x ★	1/52
Meta Muse Spark 1.1 Meta	7.6x	—	19x	20x	24x	32x ★	13x ★	28x	27/52
MiniMax M3 MiniMax	1.3x ★	7.6x	1.6x ★	1.9x ★	1.7x ★	3x ★	1.2x	2.4x ★	25/52
Kimi K2.6 Moonshot AI	7.4x	56x	22x	26x	25x	51x	10x	26x	27/52
GPT-5.5 OpenAI	22x	183x	51x	71x	14x	29x	16x	62x	30/52
GPT-5.6 Sol OpenAI	14x	84x	31x	21x	13x	19x	20x	20x	28/52
GPT-5.6 Terra OpenAI	4.5x	4.5x	8.9x	7.5x	4.8x	9.3x ★	4.5x	8.3x	26/52
GPT-5.6 Luna OpenAI	2.5x	13x ★	5.6x	3.7x	2.5x	5.1x	1.8x	3.8x ★	25/52
GPT-5.4 Nano OpenAI	1.3x ★	19x	1.8x ★	4.6x	—	1.6x	—	2.3x	9/52
GPT-5.4 Mini OpenAI	—	—	—	—	1.9x	2.5x	1.0x ★	7.2x	6/52
GPT-image-2 OpenAI	—	—	—	—	—	—	—	1.5x	1/52
Tencent Hy3 OpenRouter	—	—	2.7x	2.2x ★	2.7x ★	3.9x	1.6x ★	2.4x ★	16/52
NVIDIA Nemotron-3 Ultra 550B OpenRouter	1.9x	4.4x	6.4x	7.8x	18x	24x	3.3x	8.1x	15/52
NVIDIA Nemotron-3 Super 120B OpenRouter	—	2x	—	1.0x ★	3.1x	3.5x	—	—	6/52
NVIDIA Nemotron-3 Nano 30B-A3B OpenRouter	—	—	—	1.0x ★	1.1x	1.0x ★	—	—	3/52
Grok 4.5 xAI	6.6x	73x	22x	19x	29x	33x	7.4x	19x	28/52

Each cell shows how much you overpay vs the best-value good-enough model in that category: best value most expensive Cells show how much you overpay vs the best-value model clearing the bar (1.0x = this model is the best-value good-enough option, no overpayment; 3.2x = you pay ~3× more on average for the same quality). ★ marks any model that is the best-value pick on at least one of the category's tasks (multiple stars per column are possible — they're the per-task cost winners). Greyed cells: model doesn't qualify on any task in that category at the current bar. Rows grouped by provider; click a model name for the full per-task breakdown.

By capability

Financial Analysis & Trading Decisions

Read business/financial docs deeply enough to make forward-looking judgments about opportunity and risk.

Model	Tasks	Avg $/1k runs	Overpay
MiniMax M3 ★	3/4	$3.02	1.3x
Qwen 3.5 Flash ★	1/2	$3.11	best value
GPT-5.4 Nano ★	2/2	$7.01	1.3x
GPT-5.6 Luna	1/3	$7.68	2.5x
Qwen 3.7 Plus	3/3	$9.63	2.7x

See all 20 qualifying models →

Structured Data & Fact Extraction

Precise pattern recognition and field-level information retrieval, strict schema adherence.

Model	Tasks	Avg $/1k runs	Overpay
DeepSeek V4 Flash ★	3/3	$0.58	best value
Qwen 3.5 Flash	1/2	$0.75	1.1x
Gemini 3.1 Flash Lite	1/3	$0.83	1.2x
NVIDIA Nemotron-3 Super 120B	1/1	$1.39	2x
MiniMax M3	2/3	$3.36	7.6x

See all 22 qualifying models →

Content Summarization & Synthesis

Compress long input into essential nuggets without losing material detail.

Model	Tasks	Avg $/1k runs	Overpay
DeepSeek V4 Flash ★	1/2	$0.71	best value
Tencent Hy3	2/2	$2.54	2.7x
MiniMax M3 ★	2/3	$3.71	1.6x
NVIDIA Nemotron-3 Ultra 550B	1/2	$4.53	6.4x
GPT-5.4 Nano ★	2/3	$8.00	1.8x

See all 22 qualifying models →

Long-form Content Generation

Sustained compositional skill, voice consistency, coherent extended prose.

Model	Tasks	Avg $/1k runs	Overpay
NVIDIA Nemotron-3 Nano 30B-A3B ★	1/3	$0.28	best value
NVIDIA Nemotron-3 Super 120B ★	2/4	$1.08	best value
Tencent Hy3 ★	3/5	$1.44	2.2x
DeepSeek V4 Flash	1/2	$1.53	1.8x
MiniMax M3 ★	4/4	$1.59	1.9x

See all 23 qualifying models →

Social & Promotional Content

Conciseness, platform-native conventions, engagement under tight character limits.

Model	Tasks	Avg $/1k runs	Overpay
NVIDIA Nemotron-3 Nano 30B-A3B	1/2	$0.30	1.1x
DeepSeek V4 Flash ★	2/4	$0.42	best value
GPT-5.4 Mini	1/5	$0.52	1.9x
NVIDIA Nemotron-3 Super 120B	1/3	$0.85	3.1x
Tencent Hy3 ★	2/4	$1.30	2.7x

See all 24 qualifying models →

Relevance, Classification & Matching

Semantic similarity judgment: does this thing belong in that bucket / match that target?

Model	Tasks	Avg $/1k runs	Overpay
NVIDIA Nemotron-3 Nano 30B-A3B ★	1/2	$0.04	best value
GPT-5.4 Nano	2/9	$0.14	1.6x
Gemini 3.1 Flash Lite ★	4/10	$0.20	1.4x
GPT-5.4 Mini	3/9	$0.31	2.5x
NVIDIA Nemotron-3 Super 120B	2/4	$0.34	3.5x

See all 27 qualifying models →

Topic Organization & Clustering

Discover natural groupings without external schema, name and order them coherently.

Model	Tasks	Avg $/1k runs	Overpay
Tencent Hy3 ★	3/3	$2.05	1.6x
DeepSeek V4 Flash ★	2/3	$2.23	1.3x
GPT-5.4 Mini ★	1/3	$2.58	best value
MiniMax M3	1/2	$3.08	1.2x
Qwen 3.5 Flash	2/3	$3.21	2.5x

See all 22 qualifying models →

Infrastructure & Utility

Mechanical competence at format conversion, metadata manipulation, prompt rewriting, translation; minimal domain expertise required.

Model	Tasks	Avg $/1k runs	Overpay
Gemini 3.1 Flash Lite ★	1/7	$0.56	best value
Tencent Hy3 ★	3/7	$1.12	2.4x
DeepSeek V4 Flash ★	4/5	$1.44	1.4x
GPT-5.6 Luna ★	5/7	$2.23	3.8x
Qwen 3.5 Flash	4/9	$2.24	3.8x

See all 27 qualifying models →

Methodology, briefly

Quality scores come from LLM-judge verdicts on production workloads, on a 0–10 scale. Only MEDIUM-or-better cells qualify. The quality bar is relative to the best-performing model on each task — adjust the slider above to see your own pipeline.

Full methodology →

What changed this week

What changed — 2026-07-10 → 2026-07-20

Tasks: 52 · Models: 30 (+8) · Evaluations: 905 (+243)

New models: GPT-5.6 Luna, GPT-5.6 Sol, GPT-5.6 Terra, Meta Muse Spark 1.1, NVIDIA Nemotron-3 Nano 30B-A3B, NVIDIA Nemotron-3 Super 120B, NVIDIA Nemotron-3 Ultra 550B, Tencent Hy3

Per-model movement (every model, 90% bar) — best value = tasks where the model is the best-value good-enough option; qualifying = tasks it’s good-enough on:

Model	Best Value	Qualifying
DeepSeek V4 Flash	14 (-4)	18
MiniMax M3	11 (+5)	25 (+1)
Tencent Hy3 (new)	4 (new)	16 (new)
Gemini 3.5 Flash	2 (-5)	38 (+2)
Meta Muse Spark 1.1 (new)	2 (new)	27 (new)
GPT-5.6 Luna (new)	2 (new)	25 (new)
DeepSeek V4 Pro	2 (-2)	22 (-1)
Qwen 3.6 Flash	2 (-3)	17 (+1)
GPT-5.4 Nano	2 (+1)	9 (+3)
Gemini 3.1 Flash Lite	2 (+1)	6 (+1)
NVIDIA Nemotron-3 Super 120B (new)	2 (new)	6 (new)
NVIDIA Nemotron-3 Nano 30B-A3B (new)	2 (new)	3 (new)
GPT-5.6 Terra (new)	1 (new)	26 (new)
Claude Sonnet 5	1 (+1)	24 (+2)
Qwen 3.5 Flash	1 (-5)	14
GPT-5.4 Mini	1 (+1)	6 (+1)
Gemini 3.1 Flash Image Preview	1	1
GPT-5.5	0	30 (+1)
GPT-5.6 Sol (new)	0 (new)	28 (new)
Grok 4.5	0	28 (+2)
Kimi K2.6	0	27 (+3)
Qwen 3.6 Plus	0	24 (+1)
Qwen 3.7 Plus	0 (-2)	21 (+1)
Claude Sonnet 4.6	0	19 (+2)
Gemini 3.1 Pro Preview	0	16 (+1)
NVIDIA Nemotron-3 Ultra 550B (new)	0 (new)	15 (new)
Claude Opus 4.8	0	12 (+4)
Claude Haiku 4.5	0 (-1)	9
Gemini 3 Pro Image Preview	0	1
GPT-image-2	0	1

Default quality bar 95% → 90%

Full changelog →