Best LLMs for structured_output_extraction autogenerated — DTP Benchmark
Models
Frontier on this task: Qwen 3.6 Plus at 9.84 / 10. Quality bar at 95%: 9.35.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.5 Flash Alibaba Cloud (DashScope) | 9.78 / 10 CI [9.69, 9.87] | n=100 · high | $0.002581 | 87% cheaper | sync |
| DeepSeek V4 Flash DeepSeek | 9.75 / 10 CI [9.65, 9.85] | n=88 · ranked | $0.003030 | 85% cheaper | sync |
| MiniMax M2.5 MiniMax | 9.71 / 10 CI [9.34, 10.00] | n=85 · medium | $0.012287 | 37% cheaper | sync |
| GPT-5.4 nano OpenAI | 9.76 / 10 CI [9.54, 9.99] | n=88 · high | $0.012537 | 36% cheaper | batch |
| Qwen 3.6 Plus best Alibaba Cloud (DashScope) | 9.84 / 10 CI [9.69, 9.99] | n=100 · ranked | $0.019588 | (anchor) | sync |
| DeepSeek V4 Pro DeepSeek | 9.43 / 10 CI [9.22, 9.64] | n=89 · high | $0.037664 | — | sync |
| Kimi K2.6 Moonshot AI | 9.54 / 10 CI [9.32, 9.74] | n=99 · high | $0.040841 | — | batch |
| Gemini 3.1 Pro Preview Gemini | 9.63 / 10 CI [9.41, 9.85] | n=90 · high | $0.120540 | — | batch |
| GPT-5.5 OpenAI | 9.42 / 10 CI [9.15, 9.70] | n=81 · high | $0.301350 | — | batch |
Typical call shape for this task: 2334 input tokens → 9656 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.