Best LLMs for region_identification autogenerated — DTP Benchmark
Models
Frontier on this task: Kimi K2.6 at 9.25 / 10. Quality bar at 95%: 8.79.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Kimi K2.6 best Moonshot AI | 9.25 / 10 CI [9.07, 9.44] | n=33 · ranked | $0.004088 | (anchor) | batch |
| DeepSeek V4 Pro DeepSeek | 9.10 / 10 CI [8.80, 9.40] | n=27 · high | $0.006699 | — | sync |
| Claude Sonnet 4.6 Anthropic | 9.22 / 10 CI [8.95, 9.49] | n=25 · high | $0.012538 | — | batch |
| Claude Opus 4.7 Anthropic | 9.15 / 10 CI [8.98, 9.32] | n=25 · ranked | $0.020898 | — | batch |
Typical call shape for this task: 844 input tokens → 1503 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.