Best LLMs for activity_promo_generation autogenerated — DTP Benchmark
Models
Frontier on this task: Claude Opus 4.7 at 8.65 / 10. Quality bar at 95%: 8.21.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.5 Flash Alibaba Cloud (DashScope) | 8.46 / 10 CI [8.22, 8.70] | n=85 · high | $0.000091 | 99% cheaper | sync |
| Qwen 3.6 Plus Alibaba Cloud (DashScope) | 8.48 / 10 CI [8.33, 8.64] | n=85 · ranked | $0.000814 | 93% cheaper | sync |
| GPT-5.4 mini OpenAI | 8.62 / 10 CI [8.46, 8.77] | n=64 · ranked | $0.001878 | 84% cheaper | batch |
| Kimi K2.6 Moonshot AI | 8.64 / 10 CI [8.52, 8.77] | n=85 · ranked | $0.002034 | 82% cheaper | batch |
| DeepSeek V4 Pro DeepSeek | 8.58 / 10 CI [8.29, 8.86] | n=67 · high | $0.002944 | 74% cheaper | sync |
| Gemini 3.1 Pro Preview Gemini | 8.33 / 10 CI [8.18, 8.48] | n=61 · ranked | $0.005008 | 56% cheaper | batch |
| Claude Sonnet 4.6 Anthropic | 8.50 / 10 CI [8.16, 8.84] | n=63 · medium | $0.006903 | 40% cheaper | batch |
| Claude Opus 4.7 best Anthropic | 8.65 / 10 CI [8.33, 8.96] | n=63 · medium | $0.011505 | (anchor) | batch |
| GPT-5.5 OpenAI | 8.64 / 10 CI [8.43, 8.86] | n=64 · high | $0.012520 | — | batch |
Typical call shape for this task: 1286 input tokens → 203 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.