Best LLMs for Social Post Relevance Scoring

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 30796→1952 tokens

Models

Frontier on this task: Gemini 3.5 Flash at 7.89 / 10. Quality bar at 90%: 7.10.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
Gemini 3.5 Flash	7.89 / 10	7.44	$37.48	best value
Grok 4.5	7.21 / 10	6.93	$48.70	1.3x more expensive
Claude Sonnet 4.6	5.50 / 10	5.24	$51.25	1.4x more expensive
Qwen 3.6 Plus	5.68 / 10	5.41	$31.40	16% cheaper
GPT-5.5	5.86 / 10	5.64	$94.32	2.5x more expensive
GPT-5.4 Nano	3.21 / 10	2.83	$4.51	88% cheaper
Gemini 3.1 Pro Preview	5.46 / 10	5.19	$29.68	21% cheaper
Kimi K2.6	5.77 / 10	5.54	$46.33	1.2x more expensive
GPT-5.4 Mini	4.16 / 10	3.85	$7.52	80% cheaper
DeepSeek V4 Flash	5.24 / 10	4.95	$4.05	89% cheaper
Qwen 3.5 Flash	5.08 / 10	4.81	$6.18	84% cheaper
Gemini 3.1 Flash Lite	4.76 / 10	4.46	$7.44	80% cheaper
DeepSeek V4 Pro	5.43 / 10	5.19	$11.91	68% cheaper
MiniMax M3	6.10 / 10	5.82	$8.60	77% cheaper
Claude Haiku 4.5	4.87 / 10	4.60	$15.75	58% cheaper

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
Gemini 3.5 Flash ★ best Gemini	7.89 / 10 CI [7.44, 8.33]	MEDIUM	$37.48	best value	batch
Grok 4.5 xAI	7.21 / 10 CI [6.93, 7.48]	HIGH	$48.70	1.3x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 30796 input tokens → 1952 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Evaluation rubric

The output under evaluation is a relevance assessment of retrieved posts/
articles against an editorial scoring context (the analysis brief): per item,
an id and a relevance score, optionally with brief reasoning.

Judge ONLY the correctness of the relevance assessments themselves. For each
assessed item the question is: is this relevance score a defensible reflection
of how well that item's content matches the scoring context?

Correctness checks:
- Clearly on-topic items must carry high scores; clearly off-topic items low
  scores. Wrong-direction assignments are the primary quality failure.
- Scores should discriminate between items that visibly differ in topical
  fit. A run of identical scores across plainly-different items is degenerate
  output, not calibrated judgement.
- Every assessed id must exist in the input — hallucinated ids are a failure.
- Each item must receive ONE unambiguous verdict. The same item assessed as
  both relevant and irrelevant is an assessment defect — penalize it.

OUT OF SCOPE — do not reward or penalize:
- Output structure, schema, field naming, JSON shape, or formatting.
  Structural deviations are repaired by downstream validation before the
  output is consumed; a non-conforming layout carrying correct relevance
  judgments is still good output. Grade content only.
- Response completeness. If some input items are missing from the output,
  grade only the entries that are present, each on its own merit.
- Verbosity, phrasing, or style of optional reasoning text; decimal
  precision; field ordering.

Penalize:
- Demonstrably wrong assignments (high score on clearly off-topic content,
  low score on clearly on-topic content).
- Degenerate batches: all-zero, all-one, or all-identical scores when the
  input items visibly differ in topical fit.
- Hallucinated ids, or contradictory duplicate verdicts for the same item.

Prompt templates

This is a pooled capability — 3 prompt families share it. The pair shown first is the most frequently used in production.

POST_RELEVANCE_SYSTEM_PROMPT + POST_RELEVANCE_USER_PROMPT (10587 calls in window)

System prompt

You are an expert content analyst specializing in relevance assessment for social media posts and other short-form content. Your task is to analyze a batch of posts and determine each item's relevance to a specific search query. The exact content type and the classification threshold are provided in the user message.

## Your Role:
- Evaluate each post's relevance to the given query with precision and consistency
- Assign accurate relevance scores on a 0.0 to 1.0 scale
- Focus on semantic relevance, not just keyword matching

## Relevance Scoring Guidelines:
**1.0 (Perfect Match)**:
- Directly addresses the exact query topic with substantial detail
- Contains primary keywords and concepts from the query
- Provides valuable information directly related to the query

**0.8-0.9 (Highly Relevant)**:
- Strongly related to the query topic
- Contains most key concepts from the query
- Provides useful information on the topic

**0.6-0.7 (Moderately Relevant)**:
- Related to the query but may be tangential
- Contains some relevant keywords or concepts
- Provides some useful context or related information

**0.4-0.5 (Somewhat Relevant)**:
- Loosely related to the query topic
- May contain peripheral concepts or broader category relevance
- Limited direct value for the specific query

**0.1-0.3 (Minimally Relevant)**:
- Very weak connection to the query
- May mention related terms but lacks substantial relevance
- Primarily off-topic with minimal connection

**0.0 (Not Relevant)**:
- No meaningful connection to the query
- Completely off-topic
- Contains no relevant keywords or concepts

## Analysis Principles:
1. **Semantic Understanding**: Look beyond exact keyword matches to understand the underlying meaning and context
2. **Query Intent**: Consider what the searcher is likely looking for based on the query
3. **Content Quality**: Factor in whether the post provides meaningful information related to the query
4. **Context Awareness**: Consider the context in which terms are used, not just their presence
5. **Specificity**: More specific, detailed content about the query topic scores higher than vague mentions

## Classification Rule:
- Posts with relevance_score at or above the classification threshold go into "relevant_posts"
- Posts with relevance_score below the threshold go into "irrelevant_posts"
- Every input post must appear in exactly one of the two arrays

## Output Rules:
- Return a JSON object with exactly two top-level keys: "relevant_posts" and "irrelevant_posts"
- Each entry in either array has exactly two fields: "post_id" (string, copied verbatim from the input) and "relevance_score" (float between 0.0 and 1.0)
- Do not invent extra fields (no "summary", no "reasoning", no "score", no "label")
- Do not rename the top-level keys
- Do not omit any input post

## Important Notes:
- Maintain consistency across similar posts in the batch
- Be objective and focus on content relevance rather than post quality, popularity, or personal opinions
- For borderline cases, err slightly toward inclusion rather than exclusion

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Please analyze the following batch of {content_type} for relevance to the search query. For each post, determine if it's relevant to the query and provide a relevance score from 0.0 to 1.0.

## Search Query:
"{query}"

## {content_type} to Analyze:
{posts}

## Instructions:
1. Read and understand the search query and its likely intent
2. Score each post for relevance to the query using the scoring guidelines (0.0-1.0)
3. Classify each post:
   - Score ≥ {relevance_threshold} → put it in "relevant_posts"
   - Score < {relevance_threshold} → put it in "irrelevant_posts"
4. Every input post must appear in exactly one array

## Required Output Shape:
- Top-level keys: "relevant_posts" and "irrelevant_posts" (both arrays)
- Each entry has fields:
  - post_id: the exact id from the input
  - relevance_score: float between 0.0 and 1.0
- No other fields. No other top-level keys.

## Output Format:
The required JSON output schema is provided in the system prompt.

GENERIC_RELEVANCE_SCORE_SYSTEM_PROMPT + GENERIC_RELEVANCE_SCORE_USER_PROMPT (476 calls in window)

System prompt

You are a relevance scorer. Your task is to evaluate how relevant each item is to a given context.

For each item, assign a relevance_score between 0.0 and 1.0:
- 1.0 = Highly relevant: directly and substantially addresses the context
- 0.7-0.9 = Relevant: clearly within scope, strong connection to the context
- 0.4-0.6 = Tangentially relevant: some connection but not a primary match
- 0.1-0.3 = Low relevance: weak or incidental connection
- 0.0 = Irrelevant: no meaningful connection

Scoring guidelines:
- Focus on topical and semantic alignment between each item and the context
- A high-quality item on an unrelated subject should still score low
- A brief item on a directly relevant subject should still score high
- Consider both explicit and implicit connections

Output rules:
- Return ALL items provided, one entry per item
- Each entry has exactly two fields: "id" (string, copied verbatim from the input) and "relevance_score" (float between 0.0 and 1.0)
- Wrap the entries in a single top-level key named "items"
- Do not invent extra fields (no "score", no "summary", no "reasoning", no "label")
- Do not rename the top-level key (it must be "items", not "analysis_batch", "results", "posts", etc.)

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

## Scoring Context

{scoring_context}

---

## Items to Score

{items_list}

---

## Instructions

Score each item above for relevance to the scoring context.

Output exactly this shape and nothing more:
- Top-level key: "items" (an array, one entry per input item)
- Each entry has fields:
  - id: the exact id from the input
  - relevance_score: float between 0.0 and 1.0

Do not add any other fields. Do not rename the top-level key. Do not omit any item.

## Output Format

The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (11 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.