Best LLMs for Author Matching — DTP Benchmark
Matches content to fictional authors or creates new author personas
Models
Frontier on this task: DeepSeek V4 Pro at 8.94 / 10. Quality bar at 95%: 8.49.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| DeepSeek V4 Flash DeepSeek | 8.63 / 10 CI [8.38, 8.89] | n=100 · ranked | $0.002178 | 92% cheaper | sync |
| Kimi K2.6 Moonshot AI | 8.81 / 10 CI [8.64, 8.99] | n=100 · ranked | $0.008902 | 67% cheaper | batch |
| Gemini 3.1 Pro Preview Gemini | 8.59 / 10 CI [8.42, 8.77] | n=100 · ranked | $0.015670 | 42% cheaper | batch |
| DeepSeek V4 Pro best DeepSeek | 8.94 / 10 CI [8.74, 9.14] | n=100 · ranked | $0.027064 | (anchor) | sync |
| GPT-5.5 OpenAI | 8.57 / 10 CI [8.27, 8.87] | n=100 · ranked | $0.039175 | — | batch |
Typical call shape for this task: 15496 input tokens → 29 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.