Best LLMs for Topic Report Relevance Scoring

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 17417→749 tokens

Models

Frontier on this task: Gemini 3.5 Flash at 9.30 / 10. Quality bar at 90%: 8.37.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
Qwen 3.6 Flash	8.42 / 10	8.00	$6.68	best value
Gemini 3.5 Flash	9.30 / 10	8.98	$8.77	1.3x more expensive
Qwen 3.7 Plus	8.99 / 10	8.65	$12.62	1.9x more expensive
GPT-5.6 Terra	9.01 / 10	8.54	$23.76	3.6x more expensive
GPT-5.5	8.45 / 10	8.19	$28.70	4.3x more expensive
Grok 4.5	8.65 / 10	8.46	$32.57	4.9x more expensive
Meta Muse Spark 1.1	8.95 / 10	8.51	$33.81	5.1x more expensive
GPT-5.6 Sol	9.27 / 10	8.89	$49.53	7.4x more expensive
Claude Sonnet 4.6	8.16 / 10	7.88	$15.66	2.3x more expensive
Gemini 3.1 Pro Preview	7.65 / 10	7.18	$7.68	1.1x more expensive
Qwen 3.5 Flash	8.14 / 10	7.82	$3.85	42% cheaper
DeepSeek V4 Flash	8.31 / 10	8.00	$1.31	80% cheaper
GPT-5.4 Nano	7.84 / 10	7.43	$1.28	81% cheaper
Claude Haiku 4.5	7.79 / 10	7.44	$4.25	36% cheaper
Kimi K2.6	8.17 / 10	7.84	$17.18	2.6x more expensive
DeepSeek V4 Pro	8.15 / 10	7.76	$4.46	33% cheaper
Gemini 3.1 Flash Lite	7.99 / 10	7.56	$13.24	2x more expensive
Qwen 3.6 Plus	7.82 / 10	7.43	$6.07	9% cheaper

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
Qwen 3.6 Flash ★ Alibaba Cloud (DashScope)	8.42 / 10 CI [8.00, 8.85]	MEDIUM	$6.68	best value	batch
Gemini 3.5 Flash best Gemini	9.30 / 10 CI [8.98, 9.62]	MEDIUM	$8.77	1.3x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	8.99 / 10 CI [8.65, 9.32]	MEDIUM	$12.62	1.9x	batch
GPT-5.6 Terra OpenAI	9.01 / 10 CI [8.54, 9.48]	MEDIUM	$23.76	3.6x	batch
GPT-5.5 OpenAI	8.45 / 10 CI [8.19, 8.71]	HIGH	$28.70	4.3x	batch
Grok 4.5 xAI	8.65 / 10 CI [8.46, 8.84]	RANKED	$32.57	4.9x	batch
Meta Muse Spark 1.1 Meta	8.95 / 10 CI [8.51, 9.38]	MEDIUM	$33.81	5.1x	batch
GPT-5.6 Sol OpenAI	9.27 / 10 CI [8.89, 9.65]	MEDIUM	$49.53	7.4x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 17417 input tokens → 749 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 3 prompt families share it. The pair shown first is the most frequently used in production.

GENERIC_RELEVANCE_SCORE_SYSTEM_PROMPT + GENERIC_RELEVANCE_SCORE_USER_PROMPT (5540 calls in window)

System prompt

You are a relevance scorer. Your task is to evaluate how relevant each item is to a given context.

For each item, assign a relevance_score between 0.0 and 1.0:
- 1.0 = Highly relevant: directly and substantially addresses the context
- 0.7-0.9 = Relevant: clearly within scope, strong connection to the context
- 0.4-0.6 = Tangentially relevant: some connection but not a primary match
- 0.1-0.3 = Low relevance: weak or incidental connection
- 0.0 = Irrelevant: no meaningful connection

Scoring guidelines:
- Focus on topical and semantic alignment between each item and the context
- A high-quality item on an unrelated subject should still score low
- A brief item on a directly relevant subject should still score high
- Consider both explicit and implicit connections

Output rules:
- Return ALL items provided, one entry per item
- Each entry has exactly two fields: "id" (string, copied verbatim from the input) and "relevance_score" (float between 0.0 and 1.0)
- Wrap the entries in a single top-level key named "items"
- Do not invent extra fields (no "score", no "summary", no "reasoning", no "label")
- Do not rename the top-level key (it must be "items", not "analysis_batch", "results", "posts", etc.)

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

## Scoring Context

{scoring_context}

---

## Items to Score

{items_list}

---

## Instructions

Score each item above for relevance to the scoring context.

Output exactly this shape and nothing more:
- Top-level key: "items" (an array, one entry per input item)
- Each entry has fields:
  - id: the exact id from the input
  - relevance_score: float between 0.0 and 1.0

Do not add any other fields. Do not rename the top-level key. Do not omit any item.

## Output Format

The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (126 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.

JUDGE_QUALITY_SYSTEM + JUDGE_QUALITY_USER (10 calls in window)

System prompt

You are a strict evaluator of LLM outputs. Score how well the output fulfills the task on a 0.0–10.0 scale, using the task-specific rubric as the primary criterion.

The "Rubric" in the user message is authoritative: when it constrains or overrides any generic guidance, the rubric wins.

Scoring scale (0.0–10.0):
- 9.0–10.0: Exceptional — comprehensive, accurate, fully meets the task.
- 7.0–8.9: Good — meets most requirements; minor gaps.
- 5.0–6.9: Satisfactory — adequate but with notable limitations or errors.
- 3.0–4.9: Poor — significant gaps, errors, or partial failure.
- 0.0–2.9: Unacceptable — major failure, unusable output.

Use the provided reference examples (if any) to keep your scoring consistent: compare the current output's quality to those already-scored benchmarks and place it on the same scale. Reference examples may come from different models — judge the output on its own merits, using them only to calibrate the scale.

Output JSON matching the schema:
- score: float from 0.0 to 10.0.
- failure_mode: a short tag for the dominant deficiency (e.g. 'hallucination', 'schema_violation', 'truncated', 'off_topic'), or null when none.
- rationale: one to three sentences justifying the score.

User prompt

Rubric: {rubric}
Task: {task_slug}
Domain: {domain}

Input context:
{input_snippet}

Output to grade:
{output_snippet}

Reference examples (already-scored outputs for the same task — use them to keep scoring consistent):
{reference_examples}

Score the output from 0.0 to 10.0 against the rubric, comparing against the reference examples for consistency. Return JSON with score, failure_mode (or null), and rationale.