Cost mode:

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 4909→599 tokens

Models

Frontier on this task: GPT-5.5 at 5.69 / 10. Quality bar at 95%: 5.41.

024681095% barGPT-5.5$0.021258/call0% cheaperQwen 3.6 Plus$0.002763/call87% cheaperHaiku 4.5$0.003952/call81% cheaperClaude Opus 4.7$0.019760/call7% cheaperClaude Sonnet 4.6$0.011856/call44% cheaperDeepSeek V4 Flash$0.000855/call96% cheaperDeepSeek V4 Pro$0.010626/call50% cheaperGemini 3 Flash Preview$0.002126/call90% cheaperGemini 3.1 Flash Lite$0.001063/call95% cheaperGemini 3.1 Pro Preview$0.008503/call60% cheaperMiniMax M2.5$0.002192/call90% cheaperKimi K2.6$0.004236/call80% cheaperGPT-5.4 mini$0.003189/call85% cheaperGPT-5.4 nano$0.000865/call96% cheaper

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Cost breakdown

ModelQualitySampleBlended cost / callSavings vs bestMode
GPT-5.5 best OpenAI5.69 / 10 CI [5.48, 5.90]n=62 · high$0.021258(anchor)batch

Typical call shape for this task: 4909 input tokens → 599 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.