Best LLMs for author_living_check autogenerated — DTP Benchmark
Models
Frontier on this task: DeepSeek V4 Pro at 9.31 / 10. Quality bar at 95%: 8.85.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| DeepSeek V4 Flash DeepSeek | 8.87 / 10 CI [8.63, 9.11] | n=76 · high | $0.000688 | 92% cheaper | sync |
| Kimi K2.6 Moonshot AI | 8.88 / 10 CI [8.64, 9.12] | n=89 · high | $0.007413 | 13% cheaper | batch |
| DeepSeek V4 Pro best DeepSeek | 9.31 / 10 CI [9.22, 9.41] | n=71 · ranked | $0.008554 | (anchor) | sync |
| Gemini 3.1 Pro Preview Gemini | 8.87 / 10 CI [8.67, 9.07] | n=67 · ranked | $0.020280 | — | batch |
| GPT-5.5 OpenAI | 8.95 / 10 CI [8.66, 9.24] | n=64 · high | $0.050700 | — | batch |
Typical call shape for this task: 2304 input tokens → 1306 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.