Best LLMs for section_generation autogenerated — DTP Benchmark
Models
Frontier on this task: Qwen 3.6 Plus at 9.09 / 10. Quality bar at 95%: 8.64.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.6 Plus best Alibaba Cloud (DashScope) | 9.09 / 10 CI [9.00, 9.19] | n=8 · ranked | $0.004831 | (anchor) | sync |
| Kimi K2.6 Moonshot AI | 9.05 / 10 CI [8.92, 9.19] | n=8 · ranked | $0.006060 | — | batch |
| DeepSeek V4 Pro DeepSeek | 8.81 / 10 CI [8.51, 9.11] | n=7 · high | $0.009403 | — | sync |
| Claude Sonnet 4.6 Anthropic | 8.79 / 10 CI [8.30, 9.27] | n=6 · medium | $0.018748 | — | batch |
| Claude Opus 4.7 Anthropic | 9.06 / 10 CI [8.81, 9.32] | n=6 · high | $0.031248 | — | batch |
| GPT-5.5 OpenAI | 8.91 / 10 CI [8.73, 9.10] | n=6 · ranked | $0.037160 | — | batch |
Typical call shape for this task: 674 input tokens → 2365 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.