Cost mode:

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 2304→1306 tokens

Models

Frontier on this task: DeepSeek V4 Pro at 9.31 / 10. Quality bar at 95%: 8.85.

024681095% barDeepSeek V4 Flash$0.000688/call92% cheaperKimi K2.6$0.007413/call13% cheaperDeepSeek V4 Pro$0.008554/call0% cheaperGemini 3.1 Pro Preview$0.020280/call-137% cheaperGPT-5.5$0.050700/call-493% cheaperQwen 3.5 Flash$0.000409/call95% cheaperQwen 3.6 Plus$0.003296/call61% cheaperGemini 3 Flash Preview$0.005070/call41% cheaperGemini 3.1 Flash Lite$0.002535/call70% cheaper

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Cost breakdown

ModelQualitySampleBlended cost / callSavings vs bestMode
DeepSeek V4 Flash DeepSeek8.87 / 10 CI [8.63, 9.11]n=76 · high$0.00068892% cheapersync
Kimi K2.6 Moonshot AI8.88 / 10 CI [8.64, 9.12]n=89 · high$0.00741313% cheaperbatch
DeepSeek V4 Pro best DeepSeek9.31 / 10 CI [9.22, 9.41]n=71 · ranked$0.008554(anchor)sync
Gemini 3.1 Pro Preview Gemini8.87 / 10 CI [8.67, 9.07]n=67 · ranked$0.020280batch
GPT-5.5 OpenAI8.95 / 10 CI [8.66, 9.24]n=64 · high$0.050700batch

Typical call shape for this task: 2304 input tokens → 1306 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.