Best LLMs for topic_client_matching autogenerated — DTP Benchmark
Models
Frontier on this task: DeepSeek V4 Flash at 8.24 / 10. Quality bar at 95%: 7.82.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| DeepSeek V4 Flash best DeepSeek | 8.24 / 10 CI [8.09, 8.38] | n=100 · ranked | $0.003491 | (anchor) | sync |
| DeepSeek V4 Pro DeepSeek | 8.14 / 10 CI [7.97, 8.31] | n=100 · ranked | $0.043390 | — | sync |
| Gemini 3.1 Pro Preview Gemini | 7.88 / 10 CI [7.73, 8.04] | n=100 · ranked | $0.044257 | — | batch |
| GPT-5.5 OpenAI | 8.08 / 10 CI [7.90, 8.26] | n=100 · high | $0.110642 | — | batch |
Typical call shape for this task: 15277 input tokens → 4830 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.