Best LLMs for author_soul_generation autogenerated — DTP Benchmark
Models
Frontier on this task: Claude Opus 4.7 at 9.53 / 10. Quality bar at 95%: 9.05.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| DeepSeek V4 Flash DeepSeek | 9.26 / 10 CI [9.18, 9.33] | n=97 · ranked | $0.001420 | 99% cheaper | sync |
| DeepSeek V4 Pro DeepSeek | 9.16 / 10 CI [9.06, 9.26] | n=97 · ranked | $0.017647 | 85% cheaper | sync |
| Haiku 4.5 Anthropic | 9.14 / 10 CI [9.03, 9.25] | n=84 · ranked | $0.023960 | 80% cheaper | batch |
| Claude Sonnet 4.6 Anthropic | 9.18 / 10 CI [9.12, 9.25] | n=84 · ranked | $0.071880 | 40% cheaper | batch |
| Claude Opus 4.7 best Anthropic | 9.53 / 10 CI [9.49, 9.57] | n=84 · ranked | $0.119800 | (anchor) | batch |
| GPT-5.5 OpenAI | 9.26 / 10 CI [9.04, 9.48] | n=93 · high | $0.142830 | — | batch |
Typical call shape for this task: 930 input tokens → 4606 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.