Best LLMs for topic_cluster_naming autogenerated — DTP Benchmark
Models
Frontier on this task: GPT-5.5 at 8.59 / 10. Quality bar at 95%: 8.16.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.6 Plus Alibaba Cloud (DashScope) | 8.20 / 10 CI [7.99, 8.41] | n=100 · ranked | $0.008968 | 87% cheaper | sync |
| Haiku 4.5 Anthropic | 8.22 / 10 CI [7.99, 8.46] | n=72 · high | $0.013708 | 80% cheaper | batch |
| Kimi K2.6 Moonshot AI | 8.50 / 10 CI [8.35, 8.65] | n=100 · ranked | $0.015546 | 77% cheaper | batch |
| Claude Sonnet 4.6 Anthropic | 8.20 / 10 CI [8.02, 8.38] | n=94 · ranked | $0.041122 | 40% cheaper | batch |
| DeepSeek V4 Pro DeepSeek | 8.27 / 10 CI [8.09, 8.45] | n=77 · ranked | $0.046773 | 32% cheaper | sync |
| GPT-5.5 best OpenAI | 8.59 / 10 CI [8.44, 8.74] | n=89 · ranked | $0.068982 | (anchor) | batch |
Typical call shape for this task: 26525 input tokens → 178 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.