Best LLMs for Relevance Scoring (X Post) — DTP Benchmark
Scores batched X-com posts against synthesis capability (x_post_relevance stage). Split from pooled relevance_scoring on 2026-05-17. GENERIC_RELEVANCE_SCORE_{SYSTEM,USER}_PROMPT, batched input (≥20k tokens).
Models
Frontier on this task: Qwen 3.5 Flash at 7.30 / 10. Quality bar at 95%: 6.94.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.5 Flash best Alibaba Cloud (DashScope) | 7.30 / 10 CI [7.00, 7.60] | n=100 · high | $0.002861 | (anchor) | sync |
Typical call shape for this task: 87232 input tokens → 940 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.
Evaluation rubric
The output under evaluation is a batch of {"id", "relevance_score"}
entries scoring how well X-com (Twitter) posts match a given scoring
context (a topic name, a topic description, and a topic report summary).
Judge ONLY the correctness of the relevance assignments that were
delivered. For each entry present in the output, the question is: is
this relevance_score a defensible reflection of how well that post's
content matches the scoring context?
Apply these correctness checks:
- Clearly on-topic posts should carry high scores; clearly off-topic
posts should carry low scores. Wrong-direction assignments are the
primary quality failure.
- Scores should discriminate between posts that visibly differ in
topical fit. A run of identical scores across plainly-different posts
is degenerate output, not calibrated judgement.
- Score magnitudes should be internally consistent within the batch —
posts the model rated similarly should be similarly relevant on
inspection.
Do NOT consider response completeness. Whether the output covers every
input post is OUT OF SCOPE for this evaluation. The production pipeline
reschedules any unscored items to a different model, so partial output
is a routing outcome — not a quality defect. Grade only the entries
that ARE present, each on its own merit; treat missing entries as if
they never belonged to this LPL.
Also ignore cosmetic details that don't change the assignment's
correctness: decimal precision, JSON field ordering, whitespace,
phrasing of any optional reasoning field.
Penalise:
- Assignments that are demonstrably wrong (high score on a clearly
off-topic post, low score on a clearly on-topic post).
- Degenerate batches: all-zero, all-one, or all-identical scores when
the input posts visibly differ in topical fit.
- Outputs where the score-to-content relationship is random or absent.
Reward:
- Correct directionality and meaningful discrimination across the batch,
even when the batch is small.