Best LLMs for Social Post Promo (pooled) — DTP Benchmark
Pooled TT for single-platform article promo posts (X, Bluesky, Mastodon, Threads). Same prompt skeleton, per-platform style addendum and char limit.
Models
Frontier on this task: DeepSeek V4 Pro at 8.23 / 10. Quality bar at 95%: 7.82.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.6 Plus Alibaba Cloud (DashScope) | 8.20 / 10 CI [8.03, 8.37] | n=100 · ranked | $0.001489 | 76% cheaper | sync |
| Kimi K2.6 Moonshot AI | 8.21 / 10 CI [7.99, 8.44] | n=100 · high | $0.003924 | 37% cheaper | batch |
| DeepSeek V4 Pro best DeepSeek | 8.23 / 10 CI [7.98, 8.48] | n=96 · high | $0.006217 | (anchor) | sync |
| Claude Sonnet 4.6 Anthropic | 7.87 / 10 CI [7.63, 8.10] | n=94 · high | $0.012987 | — | batch |
| GPT-5.5 OpenAI | 7.94 / 10 CI [7.74, 8.15] | n=100 · high | $0.022905 | — | batch |
Typical call shape for this task: 3069 input tokens → 252 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.