Best LLMs for Relevance Scoring (Topic Report) — DTP Benchmark
Scores TOPIC_REPORT PartialSyntheses against analysis template (stage 132) and report chapters (stage 134). Split from pooled relevance_scoring on 2026-05-17. GENERIC_RELEVANCE_SCORE_{SYSTEM,USER}_PROMPT, single-item input (<20k tokens).
Models
Frontier on this task: Qwen 3.5 Flash at 8.41 / 10. Quality bar at 95%: 7.99.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.5 Flash best Alibaba Cloud (DashScope) | 8.41 / 10 CI [8.23, 8.60] | n=100 · ranked | $0.001397 | (anchor) | sync |
| GPT-5.4 nano OpenAI | 8.09 / 10 CI [7.79, 8.38] | n=100 · high | $0.004535 | — | batch |
| DeepSeek V4 Flash DeepSeek | 8.35 / 10 CI [8.16, 8.54] | n=100 · high | $0.006051 | — | sync |
| Gemini 3 Flash Preview Gemini | 8.18 / 10 CI [7.75, 8.60] | n=93 · medium | $0.011306 | — | batch |
| MiniMax M2.5 MiniMax | 8.08 / 10 CI [7.76, 8.40] | n=100 · high | $0.013267 | — | sync |
| Claude Sonnet 4.6 Anthropic | 8.36 / 10 CI [8.12, 8.60] | n=100 · high | $0.067086 | — | batch |
| Claude Opus 4.7 Anthropic | 8.08 / 10 CI [7.78, 8.39] | n=59 · medium | $0.111810 | — | batch |
| GPT-5.5 OpenAI | 8.29 / 10 CI [8.02, 8.56] | n=100 · high | $0.113062 | — | batch |
Typical call shape for this task: 42219 input tokens → 501 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.