Best LLMs for image_prompt_generation autogenerated — DTP Benchmark
Models
Frontier on this task: Claude Opus 4.7 at 8.57 / 10. Quality bar at 95%: 8.15.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.5 Flash Alibaba Cloud (DashScope) | 8.30 / 10 CI [8.17, 8.44] | n=100 · ranked | $0.000515 | 98% cheaper | sync |
| DeepSeek V4 Flash DeepSeek | 8.35 / 10 CI [8.21, 8.50] | n=100 · ranked | $0.000811 | 97% cheaper | sync |
| Qwen 3.6 Plus Alibaba Cloud (DashScope) | 8.34 / 10 CI [8.23, 8.44] | n=100 · ranked | $0.004102 | 85% cheaper | sync |
| Kimi K2.6 Moonshot AI | 8.27 / 10 CI [8.13, 8.41] | n=100 · ranked | $0.005452 | 80% cheaper | batch |
| DeepSeek V4 Pro DeepSeek | 8.43 / 10 CI [8.29, 8.58] | n=100 · ranked | $0.010078 | 63% cheaper | sync |
| Claude Opus 4.7 best Anthropic | 8.57 / 10 CI [8.42, 8.73] | n=100 · ranked | $0.027282 | (anchor) | batch |
Typical call shape for this task: 2378 input tokens → 1707 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.