Models — The Right-Sized LLM Benchmark

One page per LLM in the benchmark — its performance across every task, where it’s the cheapest qualifier at the default quality bar, and which categories it does and doesn’t qualify on.

Claude Haiku 4.5
Good enough on 9/52 tasks at the 90% bar. Best value on 0 tasks. Doesn't qualify on any: Structured Data & Fact Extraction, Social & Promotional Content, Infrastructure & Utility.
Claude Opus 4.8
Good enough on 12/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Financial Analysis & Trading Decisions, Structured Data & Fact Extraction, Social & Promotional Content, Relevance, Clas
Claude Sonnet 4.6
Good enough on 19/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Long-form Content Generation.
Claude Sonnet 5
Good enough on 24/52 tasks at the 90% bar. Best value on 1 task. Best fit for: Financial Analysis & Trading Decisions, Long-form Content Generation, Social & Promotional Content, Infrastructure & Util
DeepSeek V4 Flash
Good enough on 18/52 tasks at the 90% bar. Best value on 14 tasks. Best fit for: Structured Data & Fact Extraction, Infrastructure & Utility. Doesn't qualify on any: Financial Analysis & Trading Decis
DeepSeek V4 Pro
Good enough on 22/52 tasks at the 90% bar. Best value on 2 tasks. Best fit for: Structured Data & Fact Extraction, Long-form Content Generation.
Gemini 3 Pro Image Preview
Good enough on 1/52 tasks at the 90% bar. Best value on 0 tasks.
Gemini 3.1 Flash Image Preview
Good enough on 1/52 tasks at the 90% bar. Best value on 1 task.
Gemini 3.1 Flash Lite
Good enough on 6/52 tasks at the 90% bar. Best value on 2 tasks. Doesn't qualify on any: Financial Analysis & Trading Decisions, Content Summarization & Synthesis, Long-form Content Generation, Social
Gemini 3.1 Pro Preview
Good enough on 16/52 tasks at the 90% bar. Best value on 0 tasks. Doesn't qualify on any: Financial Analysis & Trading Decisions.
Gemini 3.5 Flash
Good enough on 38/52 tasks at the 90% bar. Best value on 2 tasks. Best fit for: Financial Analysis & Trading Decisions, Structured Data & Fact Extraction, Content Summarization & Synthesis, Long-form
GPT-5.4 Mini
Good enough on 6/52 tasks at the 90% bar. Best value on 1 task. Doesn't qualify on any: Financial Analysis & Trading Decisions, Structured Data & Fact Extraction, Content Summarization & Synthesis, Lo
GPT-5.4 Nano
Good enough on 9/52 tasks at the 90% bar. Best value on 2 tasks. Best fit for: Financial Analysis & Trading Decisions. Doesn't qualify on any: Social & Promotional Content, Topic Organization & Cluste
GPT-5.5
Good enough on 30/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Financial Analysis & Trading Decisions, Structured Data & Fact Extraction, Long-form Content Generation.
GPT-5.6 Luna
Good enough on 25/52 tasks at the 90% bar. Best value on 2 tasks. Best fit for: Structured Data & Fact Extraction, Content Summarization & Synthesis, Long-form Content Generation.
GPT-5.6 Sol
Good enough on 28/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Structured Data & Fact Extraction, Long-form Content Generation, Social & Promotional Content, Relevance, Classification
GPT-5.6 Terra
Good enough on 26/52 tasks at the 90% bar. Best value on 1 task. Best fit for: Long-form Content Generation, Social & Promotional Content.
GPT-image-2
Good enough on 1/52 tasks at the 90% bar. Best value on 0 tasks.
Grok 4.5
Good enough on 28/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Financial Analysis & Trading Decisions, Social & Promotional Content, Relevance, Classification & Matching.
Kimi K2.6
Good enough on 27/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Financial Analysis & Trading Decisions, Structured Data & Fact Extraction, Content Summarization & Synthesis, Social & P
Meta Muse Spark 1.1
Good enough on 27/52 tasks at the 90% bar. Best value on 2 tasks. Best fit for: Social & Promotional Content, Relevance, Classification & Matching, Topic Organization & Clustering. Doesn't qualify on
MiniMax M3
Good enough on 25/52 tasks at the 90% bar. Best value on 11 tasks. Best fit for: Financial Analysis & Trading Decisions, Long-form Content Generation, Social & Promotional Content.
NVIDIA Nemotron-3 Nano 30B-A3B
Good enough on 3/52 tasks at the 90% bar. Best value on 2 tasks. Doesn't qualify on any: Financial Analysis & Trading Decisions, Infrastructure & Utility.
NVIDIA Nemotron-3 Super 120B
Good enough on 6/52 tasks at the 90% bar. Best value on 2 tasks. Doesn't qualify on any: Financial Analysis & Trading Decisions, Topic Organization & Clustering, Infrastructure & Utility.
NVIDIA Nemotron-3 Ultra 550B
Good enough on 15/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Long-form Content Generation, Relevance, Classification & Matching, Topic Organization & Clustering.
Qwen 3.5 Flash
Good enough on 14/52 tasks at the 90% bar. Best value on 1 task. Doesn't qualify on any: Content Summarization & Synthesis, Long-form Content Generation.
Qwen 3.6 Flash
Good enough on 17/52 tasks at the 90% bar. Best value on 2 tasks. Best fit for: Financial Analysis & Trading Decisions.
Qwen 3.6 Plus
Good enough on 24/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Financial Analysis & Trading Decisions.
Qwen 3.7 Plus
Good enough on 21/52 tasks at the 90% bar. Best value on 0 tasks. Best fit for: Financial Analysis & Trading Decisions, Long-form Content Generation, Social & Promotional Content, Relevance, Classific
Tencent Hy3
Good enough on 16/52 tasks at the 90% bar. Best value on 4 tasks. Best fit for: Content Summarization & Synthesis, Topic Organization & Clustering. Doesn't qualify on any: Financial Analysis & Trading