Best LLMs for Claim-Referenced Analyst Writing (pooled) — DTP Benchmark
Pooled TT for analyst-prose synthesis tasks that must preserve [N] workflow-global claim references: cluster_claim_synthesis, chapter_consolidation, topic_report_generation. Wired to the validate_citations_preserved validator at call sites.
Models
Frontier on this task: Claude Sonnet 4.6 at 8.98 / 10. Quality bar at 95%: 8.53.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Kimi K2.6 Moonshot AI | 8.77 / 10 CI [8.48, 9.06] | n=100 · high | $0.054578 | 67% cheaper | batch |
| DeepSeek V4 Pro DeepSeek | 8.69 / 10 CI [8.36, 9.03] | n=63 · high | $0.093518 | 44% cheaper | sync |
| Claude Sonnet 4.6 best Anthropic | 8.98 / 10 CI [8.69, 9.26] | n=84 · high | $0.166128 | (anchor) | batch |
| Claude Opus 4.7 Anthropic | 8.82 / 10 CI [8.57, 9.07] | n=100 · high | $0.276880 | — | batch |
| GPT-5.5 OpenAI | 8.53 / 10 CI [8.29, 8.77] | n=90 · high | $0.324385 | — | batch |
Typical call shape for this task: 15742 input tokens → 19002 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.