Cost mode:

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 15277→4830 tokens

Models

Frontier on this task: DeepSeek V4 Flash at 8.24 / 10. Quality bar at 95%: 7.82.

024681095% barDeepSeek V4 Flash$0.003491/call0% cheaperDeepSeek V4 Pro$0.043390/call-1143% cheaperGemini 3.1 Pro Preview$0.044257/call-1168% cheaperGPT-5.5$0.110642/call-3069% cheaperQwen 3.6 Plus$0.014384/call-312% cheaperHaiku 4.5$0.019714/call-465% cheaperClaude Opus 4.7$0.098568/call-2723% cheaperClaude Sonnet 4.6$0.059140/call-1594% cheaperGemini 3 Flash Preview$0.011064/call-217% cheaperGemini 3.1 Flash Lite$0.005532/call-58% cheaperMiniMax M2.5$0.010379/call-197% cheaperKimi K2.6$0.020300/call-481% cheaperGPT-5.4 mini$0.016596/call-375% cheaperGPT-5.4 nano$0.004546/call-30% cheaper

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Cost breakdown

ModelQualitySampleBlended cost / callSavings vs bestMode
DeepSeek V4 Flash best DeepSeek8.24 / 10 CI [8.09, 8.38]n=100 · ranked$0.003491(anchor)sync
DeepSeek V4 Pro DeepSeek8.14 / 10 CI [7.97, 8.31]n=100 · ranked$0.043390sync
Gemini 3.1 Pro Preview Gemini7.88 / 10 CI [7.73, 8.04]n=100 · ranked$0.044257batch
GPT-5.5 OpenAI8.08 / 10 CI [7.90, 8.26]n=100 · high$0.110642batch

Typical call shape for this task: 15277 input tokens → 4830 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.