Cost mode:

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 18712→1218 tokens

Models

Frontier on this task: GPT-5.5 at 5.92 / 10. Quality bar at 95%: 5.62.

024681095% barQwen 3.6 Plus$0.008456/call87% cheaperGPT-5.5$0.065050/call0% cheaperQwen 3.5 Flash$0.000878/call99% cheaperHaiku 4.5$0.012401/call81% cheaperClaude Sonnet 4.6$0.037203/call43% cheaperDeepSeek V4 Flash$0.002961/call95% cheaperDeepSeek V4 Pro$0.036798/call43% cheaperGemini 3 Flash Preview$0.006505/call90% cheaperGemini 3.1 Flash Lite$0.003252/call95% cheaperGemini 3.1 Pro Preview$0.026020/call60% cheaperMiniMax M2.5$0.007075/call89% cheaperKimi K2.6$0.013589/call79% cheaperGPT-5.4 mini$0.009758/call85% cheaper

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Cost breakdown

ModelQualitySampleBlended cost / callSavings vs bestMode
Qwen 3.6 Plus Alibaba Cloud (DashScope)5.88 / 10 CI [5.49, 6.26]n=100 · ranked$0.00845687% cheapersync
GPT-5.5 best OpenAI5.92 / 10 CI [5.56, 6.28]n=100 · ranked$0.065050(anchor)batch

Typical call shape for this task: 18712 input tokens → 1218 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.