Best LLMs for LLM Prompt Adaptation — DTP Benchmark
Models
Frontier on this task: Claude Sonnet 4.6 at 9.10 / 10. Quality bar at 95%: 8.64.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.6 Plus Alibaba Cloud (DashScope) | 8.89 / 10 CI [8.79, 8.98] | n=100 · ranked | $0.009151 | 74% cheaper | sync |
| Kimi K2.6 Moonshot AI | 8.73 / 10 CI [8.53, 8.92] | n=100 · ranked | $0.011599 | 68% cheaper | batch |
| DeepSeek V4 Pro DeepSeek | 8.73 / 10 CI [8.48, 8.97] | n=100 · ranked | $0.018625 | 48% cheaper | sync |
| Claude Sonnet 4.6 best Anthropic | 9.10 / 10 CI [9.00, 9.19] | n=100 · ranked | $0.035690 | (anchor) | batch |
| Claude Opus 4.7 Anthropic | 8.83 / 10 CI [8.67, 8.99] | n=100 · ranked | $0.059482 | — | batch |
Typical call shape for this task: 1978 input tokens → 4363 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.