Cost mode:

Category: Infrastructure & Utility · Rail: absolute · Typical I/O: 2378→1707 tokens

Models

Frontier on this task: Claude Opus 4.7 at 8.57 / 10. Quality bar at 95%: 8.15.

024681095% barQwen 3.5 Flash$0.000515/call98% cheaperDeepSeek V4 Flash$0.000811/call97% cheaperQwen 3.6 Plus$0.004102/call85% cheaperKimi K2.6$0.005452/call80% cheaperDeepSeek V4 Pro$0.010078/call63% cheaperClaude Opus 4.7$0.027282/call0% cheaperHaiku 4.5$0.005456/call80% cheaperClaude Sonnet 4.6$0.016370/call40% cheaperGemini 3 Flash Preview$0.003155/call88% cheaperGemini 3.1 Flash Lite$0.001578/call94% cheaperGemini 3.1 Pro Preview$0.012620/call54% cheaperMiniMax M2.5$0.002762/call90% cheaperGPT-5.4 mini$0.004732/call83% cheaperGPT-5.4 nano$0.001305/call95% cheaperGPT-5.5$0.031550/call-16% cheaper

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Cost breakdown

ModelQualitySampleBlended cost / callSavings vs bestMode
Qwen 3.5 Flash Alibaba Cloud (DashScope)8.30 / 10 CI [8.17, 8.44]n=100 · ranked$0.00051598% cheapersync
DeepSeek V4 Flash DeepSeek8.35 / 10 CI [8.21, 8.50]n=100 · ranked$0.00081197% cheapersync
Qwen 3.6 Plus Alibaba Cloud (DashScope)8.34 / 10 CI [8.23, 8.44]n=100 · ranked$0.00410285% cheapersync
Kimi K2.6 Moonshot AI8.27 / 10 CI [8.13, 8.41]n=100 · ranked$0.00545280% cheaperbatch
DeepSeek V4 Pro DeepSeek8.43 / 10 CI [8.29, 8.58]n=100 · ranked$0.01007863% cheapersync
Claude Opus 4.7 best Anthropic8.57 / 10 CI [8.42, 8.73]n=100 · ranked$0.027282(anchor)batch

Typical call shape for this task: 2378 input tokens → 1707 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.