Cost mode:

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 42219→501 tokens

Models

Frontier on this task: Qwen 3.5 Flash at 8.41 / 10. Quality bar at 95%: 7.99.

024681095% barQwen 3.5 Flash$0.001397/call0% cheaperGPT-5.4 nano$0.004535/call-225% cheaperDeepSeek V4 Flash$0.006051/call-333% cheaperGemini 3 Flash Preview$0.011306/call-709% cheaperMiniMax M2.5$0.013267/call-850% cheaperClaude Sonnet 4.6$0.067086/call-4702% cheaperClaude Opus 4.7$0.111810/call-7904% cheaperGPT-5.5$0.113062/call-7993% cheaperQwen 3.6 Plus$0.014698/call-952% cheaperHaiku 4.5$0.022362/call-1501% cheaperDeepSeek V4 Pro$0.075205/call-5283% cheaperGemini 3.1 Flash Lite$0.005653/call-305% cheaperGemini 3.1 Pro Preview$0.045225/call-3137% cheaperKimi K2.6$0.025267/call-1709% cheaper

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Cost breakdown

ModelQualitySampleBlended cost / callSavings vs bestMode
Qwen 3.5 Flash best Alibaba Cloud (DashScope)8.41 / 10 CI [8.23, 8.60]n=100 · ranked$0.001397(anchor)sync
GPT-5.4 nano OpenAI8.09 / 10 CI [7.79, 8.38]n=100 · high$0.004535batch
DeepSeek V4 Flash DeepSeek8.35 / 10 CI [8.16, 8.54]n=100 · high$0.006051sync
Gemini 3 Flash Preview Gemini8.18 / 10 CI [7.75, 8.60]n=93 · medium$0.011306batch
MiniMax M2.5 MiniMax8.08 / 10 CI [7.76, 8.40]n=100 · high$0.013267sync
Claude Sonnet 4.6 Anthropic8.36 / 10 CI [8.12, 8.60]n=100 · high$0.067086batch
Claude Opus 4.7 Anthropic8.08 / 10 CI [7.78, 8.39]n=59 · medium$0.111810batch
GPT-5.5 OpenAI8.29 / 10 CI [8.02, 8.56]n=100 · high$0.113062batch

Typical call shape for this task: 42219 input tokens → 501 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.