Best LLMs for Relevance Scoring (POST) — DTP Benchmark
Scores RetrievedContent against synthesis capability description (stage 40 relevance_analysis). Split from pooled relevance_scoring on 2026-05-17 to remove inter-family σ inflation. POST_RELEVANCE_{SYSTEM,USER}_PROMPT.
Models
Frontier on this task: GPT-5.5 at 5.92 / 10. Quality bar at 95%: 5.62.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.6 Plus Alibaba Cloud (DashScope) | 5.88 / 10 CI [5.49, 6.26] | n=100 · ranked | $0.008456 | 87% cheaper | sync |
| GPT-5.5 best OpenAI | 5.92 / 10 CI [5.56, 6.28] | n=100 · ranked | $0.065050 | (anchor) | batch |
Typical call shape for this task: 18712 input tokens → 1218 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.