Best LLMs for synthesis_of_titles_for_publication autogenerated — DTP Benchmark
Models
Frontier on this task: Kimi K2.6 at 8.68 / 10. Quality bar at 95%: 8.25.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Kimi K2.6 best Moonshot AI | 8.68 / 10 CI [8.55, 8.81] | n=90 · ranked | $0.005978 | (anchor) | batch |
| Gemini 3.1 Pro Preview Gemini | 8.25 / 10 CI [8.13, 8.38] | n=69 · ranked | $0.013731 | — | batch |
Typical call shape for this task: 2859 input tokens → 1812 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.