Best LLMs for X Post Relevance Scoring

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 74852→2323 tokens

Models

Frontier on this task: Gemini 3.5 Flash at 7.39 / 10. Quality bar at 90%: 6.65.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
MiniMax M3	7.21 / 10	7.10	$19.00	best value
Gemini 3.5 Flash	7.39 / 10	6.98	$49.84	2.6x more expensive
Meta Muse Spark 1.1	6.97 / 10	6.55	$105.90	5.6x more expensive
Grok 4.5	6.80 / 10	6.47	$132.35	7x more expensive
DeepSeek V4 Pro	5.77 / 10	5.42	$26.74	1.4x more expensive
Qwen 3.5 Flash	5.52 / 10	5.10	$7.58	60% cheaper
Gemini 3.1 Flash Lite	4.33 / 10	3.96	$9.95	48% cheaper
Qwen 3.6 Plus	5.82 / 10	5.52	$38.46	2x more expensive
Claude Haiku 4.5	4.61 / 10	4.30	$31.00	1.6x more expensive
GPT-5.5	6.25 / 10	5.93	$171.24	9x more expensive
Kimi K2.6	6.09 / 10	5.74	$79.92	4.2x more expensive
GPT-5.4 Mini	4.66 / 10	4.34	$16.10	15% cheaper
Claude Sonnet 4.6	5.62 / 10	5.25	$93.50	4.9x more expensive
Gemini 3.1 Pro Preview	4.89 / 10	4.49	$65.57	3.5x more expensive
DeepSeek V4 Flash	5.70 / 10	5.28	$9.94	48% cheaper
GPT-5.4 Nano	2.39 / 10	1.92	$8.04	58% cheaper
Qwen 3.6 Flash	5.84 / 10	5.36	$52.50	2.8x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
MiniMax M3 ★ MiniMax	7.21 / 10 CI [7.10, 7.33]	RANKED	$19.00	best value	batch
Gemini 3.5 Flash best Gemini	7.39 / 10 CI [6.98, 7.80]	MEDIUM	$49.84	2.6x	batch
Meta Muse Spark 1.1 Meta	6.97 / 10 CI [6.55, 7.39]	MEDIUM	$105.90	5.6x	batch
Grok 4.5 xAI	6.80 / 10 CI [6.47, 7.13]	MEDIUM	$132.35	7x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 74852 input tokens → 2323 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Evaluation rubric

The output under evaluation is a batch of {"id", "relevance_score"}
entries scoring how well X-com (Twitter) posts match a given scoring
context (a topic name, a topic description, and a topic report summary).

Judge ONLY the correctness of the relevance assignments that were
delivered. For each entry present in the output, the question is: is
this relevance_score a defensible reflection of how well that post's
content matches the scoring context?

Apply these correctness checks:
- Clearly on-topic posts should carry high scores; clearly off-topic
  posts should carry low scores. Wrong-direction assignments are the
  primary quality failure.
- Scores should discriminate between posts that visibly differ in
  topical fit. A run of identical scores across plainly-different posts
  is degenerate output, not calibrated judgement.
- Score magnitudes should be internally consistent within the batch —
  posts the model rated similarly should be similarly relevant on
  inspection.

Do NOT consider response completeness. Whether the output covers every
input post is OUT OF SCOPE for this evaluation. The production pipeline
reschedules any unscored items to a different model, so partial output
is a routing outcome — not a quality defect. Grade only the entries
that ARE present, each on its own merit; treat missing entries as if
they never belonged to this LPL.

Also ignore cosmetic details that don't change the assignment's
correctness: decimal precision, JSON field ordering, whitespace,
phrasing of any optional reasoning field.

Penalise:
- Assignments that are demonstrably wrong (high score on a clearly
  off-topic post, low score on a clearly on-topic post).
- Degenerate batches: all-zero, all-one, or all-identical scores when
  the input posts visibly differ in topical fit.
- Outputs where the score-to-content relationship is random or absent.

Reward:
- Correct directionality and meaningful discrimination across the batch,
  even when the batch is small.

Prompt templates

This is a pooled capability — 2 prompt families share it. The pair shown first is the most frequently used in production.

GENERIC_RELEVANCE_SCORE_SYSTEM_PROMPT + GENERIC_RELEVANCE_SCORE_USER_PROMPT (1782 calls in window)

System prompt

You are a relevance scorer. Your task is to evaluate how relevant each item is to a given context.

For each item, assign a relevance_score between 0.0 and 1.0:
- 1.0 = Highly relevant: directly and substantially addresses the context
- 0.7-0.9 = Relevant: clearly within scope, strong connection to the context
- 0.4-0.6 = Tangentially relevant: some connection but not a primary match
- 0.1-0.3 = Low relevance: weak or incidental connection
- 0.0 = Irrelevant: no meaningful connection

Scoring guidelines:
- Focus on topical and semantic alignment between each item and the context
- A high-quality item on an unrelated subject should still score low
- A brief item on a directly relevant subject should still score high
- Consider both explicit and implicit connections

Output rules:
- Return ALL items provided, one entry per item
- Each entry has exactly two fields: "id" (string, copied verbatim from the input) and "relevance_score" (float between 0.0 and 1.0)
- Wrap the entries in a single top-level key named "items"
- Do not invent extra fields (no "score", no "summary", no "reasoning", no "label")
- Do not rename the top-level key (it must be "items", not "analysis_batch", "results", "posts", etc.)

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

## Scoring Context

{scoring_context}

---

## Items to Score

{items_list}

---

## Instructions

Score each item above for relevance to the scoring context.

Output exactly this shape and nothing more:
- Top-level key: "items" (an array, one entry per input item)
- Each entry has fields:
  - id: the exact id from the input
  - relevance_score: float between 0.0 and 1.0

Do not add any other fields. Do not rename the top-level key. Do not omit any item.

## Output Format

The required JSON output schema is provided in the system prompt.

POST_RELEVANCE_SYSTEM_PROMPT + POST_RELEVANCE_USER_PROMPT (624 calls in window)

System prompt

You are an expert content analyst specializing in relevance assessment for social media posts and other short-form content. Your task is to analyze a batch of posts and determine each item's relevance to a specific search query. The exact content type and the classification threshold are provided in the user message.

## Your Role:
- Evaluate each post's relevance to the given query with precision and consistency
- Assign accurate relevance scores on a 0.0 to 1.0 scale
- Focus on semantic relevance, not just keyword matching

## Relevance Scoring Guidelines:
**1.0 (Perfect Match)**:
- Directly addresses the exact query topic with substantial detail
- Contains primary keywords and concepts from the query
- Provides valuable information directly related to the query

**0.8-0.9 (Highly Relevant)**:
- Strongly related to the query topic
- Contains most key concepts from the query
- Provides useful information on the topic

**0.6-0.7 (Moderately Relevant)**:
- Related to the query but may be tangential
- Contains some relevant keywords or concepts
- Provides some useful context or related information

**0.4-0.5 (Somewhat Relevant)**:
- Loosely related to the query topic
- May contain peripheral concepts or broader category relevance
- Limited direct value for the specific query

**0.1-0.3 (Minimally Relevant)**:
- Very weak connection to the query
- May mention related terms but lacks substantial relevance
- Primarily off-topic with minimal connection

**0.0 (Not Relevant)**:
- No meaningful connection to the query
- Completely off-topic
- Contains no relevant keywords or concepts

## Analysis Principles:
1. **Semantic Understanding**: Look beyond exact keyword matches to understand the underlying meaning and context
2. **Query Intent**: Consider what the searcher is likely looking for based on the query
3. **Content Quality**: Factor in whether the post provides meaningful information related to the query
4. **Context Awareness**: Consider the context in which terms are used, not just their presence
5. **Specificity**: More specific, detailed content about the query topic scores higher than vague mentions

## Classification Rule:
- Posts with relevance_score at or above the classification threshold go into "relevant_posts"
- Posts with relevance_score below the threshold go into "irrelevant_posts"
- Every input post must appear in exactly one of the two arrays

## Output Rules:
- Return a JSON object with exactly two top-level keys: "relevant_posts" and "irrelevant_posts"
- Each entry in either array has exactly two fields: "post_id" (string, copied verbatim from the input) and "relevance_score" (float between 0.0 and 1.0)
- Do not invent extra fields (no "summary", no "reasoning", no "score", no "label")
- Do not rename the top-level keys
- Do not omit any input post

## Important Notes:
- Maintain consistency across similar posts in the batch
- Be objective and focus on content relevance rather than post quality, popularity, or personal opinions
- For borderline cases, err slightly toward inclusion rather than exclusion

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Please analyze the following batch of {content_type} for relevance to the search query. For each post, determine if it's relevant to the query and provide a relevance score from 0.0 to 1.0.

## Search Query:
"{query}"

## {content_type} to Analyze:
{posts}

## Instructions:
1. Read and understand the search query and its likely intent
2. Score each post for relevance to the query using the scoring guidelines (0.0-1.0)
3. Classify each post:
   - Score ≥ {relevance_threshold} → put it in "relevant_posts"
   - Score < {relevance_threshold} → put it in "irrelevant_posts"
4. Every input post must appear in exactly one array

## Required Output Shape:
- Top-level keys: "relevant_posts" and "irrelevant_posts" (both arrays)
- Each entry has fields:
  - post_id: the exact id from the input
  - relevance_score: float between 0.0 and 1.0
- No other fields. No other top-level keys.

## Output Format:
The required JSON output schema is provided in the system prompt.