Best LLMs for topic_clustering_assign_sections autogenerated — DTP Benchmark
Models
Frontier on this task: GPT-5.5 at 8.73 / 10. Quality bar at 95%: 8.30.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.5 Flash Alibaba Cloud (DashScope) | 8.65 / 10 CI [8.41, 8.89] | n=100 · ranked | $0.000112 | 99% cheaper | sync |
| DeepSeek V4 Flash DeepSeek | 8.40 / 10 CI [8.15, 8.65] | n=100 · high | $0.000462 | 95% cheaper | sync |
| Qwen 3.6 Plus Alibaba Cloud (DashScope) | 8.73 / 10 CI [8.46, 9.00] | n=100 · high | $0.001155 | 87% cheaper | sync |
| Kimi K2.6 Moonshot AI | 8.64 / 10 CI [8.33, 8.95] | n=100 · high | $0.001960 | 78% cheaper | batch |
| Gemini 3.1 Pro Preview Gemini | 8.61 / 10 CI [8.32, 8.90] | n=99 · high | $0.003554 | 60% cheaper | batch |
| GPT-5.5 best OpenAI | 8.73 / 10 CI [8.50, 8.97] | n=100 · high | $0.008885 | (anchor) | batch |
Typical call shape for this task: 3170 input tokens → 64 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.