Best LLMs for claim_refinement autogenerated — DTP Benchmark
Models
Frontier on this task: Qwen 3.6 Plus at 8.08 / 10. Quality bar at 95%: 7.68.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Gemini 3 Flash Preview Gemini | 7.73 / 10 CI [7.51, 7.95] | n=100 · ranked | $0.000862 | 23% cheaper | batch |
| Qwen 3.6 Plus best Alibaba Cloud (DashScope) | 8.08 / 10 CI [7.70, 8.47] | n=100 · high | $0.001121 | (anchor) | sync |
Typical call shape for this task: 2062 input tokens → 231 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.