Best LLMs for X Post Selection

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 1575→1102 tokens

Models

Frontier on this task: GPT-5.5 at 8.75 / 10. Quality bar at 90%: 7.87.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
DeepSeek V4 Flash	8.72 / 10	8.58	$0.20	best value
Gemini 3.1 Flash Lite	8.47 / 10	8.21	$0.31	1.6x more expensive
GPT-5.4 Mini	8.12 / 10	7.82	$0.42	2.1x more expensive
Qwen 3.5 Flash	8.51 / 10	8.29	$0.53	2.7x more expensive
MiniMax M3	8.36 / 10	8.28	$0.92	4.7x more expensive
DeepSeek V4 Pro	8.40 / 10	8.14	$0.97	4.9x more expensive
GPT-5.6 Luna	8.41 / 10	8.19	$1.02	5.1x more expensive
Tencent Hy3	8.11 / 10	7.89	$1.20	6x more expensive
Qwen 3.6 Plus	8.50 / 10	8.30	$1.49	7.5x more expensive
Claude Haiku 4.5	8.04 / 10	7.64	$1.51	7.6x more expensive
GPT-5.6 Terra	8.58 / 10	8.40	$2.18	11x more expensive
Gemini 3.1 Pro Preview	8.51 / 10	8.28	$2.32	12x more expensive
Kimi K2.6	8.65 / 10	8.46	$3.36	17x more expensive
Claude Sonnet 5	8.33 / 10	8.17	$3.94	20x more expensive
Claude Sonnet 4.6	8.71 / 10	8.51	$4.52	23x more expensive
Qwen 3.7 Plus	8.35 / 10	8.18	$4.81	24x more expensive
GPT-5.5	8.75 / 10	8.60	$4.89	25x more expensive
GPT-5.6 Sol	8.47 / 10	8.27	$5.44	27x more expensive
Qwen 3.6 Flash	8.59 / 10	8.41	$6.52	33x more expensive
Gemini 3.5 Flash	8.48 / 10	8.38	$7.39	37x more expensive
Grok 4.5	8.32 / 10	8.15	$17.71	89x more expensive
Meta Muse Spark 1.1	8.38 / 10	8.16	$19.20	97x more expensive
GPT-5.4 Nano	7.68 / 10	7.31	$0.22	1.1x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
DeepSeek V4 Flash ★ DeepSeek	8.72 / 10 CI [8.58, 8.86]	RANKED	$0.20	best value	batch
Gemini 3.1 Flash Lite Gemini	8.47 / 10 CI [8.21, 8.73]	HIGH	$0.31	1.6x	batch
GPT-5.4 Mini OpenAI	8.12 / 10 CI [7.82, 8.43]	MEDIUM	$0.42	2.1x	batch
Qwen 3.5 Flash Alibaba Cloud (DashScope)	8.51 / 10 CI [8.29, 8.73]	HIGH	$0.53	2.7x	batch
MiniMax M3 MiniMax	8.36 / 10 CI [8.28, 8.43]	RANKED	$0.92	4.7x	batch
DeepSeek V4 Pro DeepSeek	8.40 / 10 CI [8.14, 8.67]	HIGH	$0.97	4.9x	batch
GPT-5.6 Luna OpenAI	8.41 / 10 CI [8.19, 8.62]	HIGH	$1.02	5.1x	batch
Tencent Hy3 OpenRouter	8.11 / 10 CI [7.89, 8.33]	HIGH	$1.20	6x	batch
Qwen 3.6 Plus Alibaba Cloud (DashScope)	8.50 / 10 CI [8.30, 8.69]	RANKED	$1.49	7.5x	batch
Claude Haiku 4.5 Anthropic	8.04 / 10 CI [7.64, 8.45]	MEDIUM	$1.51	7.6x	batch
GPT-5.6 Terra OpenAI	8.58 / 10 CI [8.40, 8.76]	RANKED	$2.18	11x	batch
Gemini 3.1 Pro Preview Gemini	8.51 / 10 CI [8.28, 8.74]	HIGH	$2.32	12x	batch
Kimi K2.6 Moonshot AI	8.65 / 10 CI [8.46, 8.84]	RANKED	$3.36	17x	batch
Claude Sonnet 5 Anthropic	8.33 / 10 CI [8.17, 8.50]	RANKED	$3.94	20x	batch
Claude Sonnet 4.6 Anthropic	8.71 / 10 CI [8.51, 8.91]	HIGH	$4.52	23x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	8.35 / 10 CI [8.18, 8.51]	RANKED	$4.81	24x	batch
GPT-5.5 best OpenAI	8.75 / 10 CI [8.60, 8.89]	RANKED	$4.89	25x	batch
GPT-5.6 Sol OpenAI	8.47 / 10 CI [8.27, 8.67]	RANKED	$5.44	27x	batch
Qwen 3.6 Flash Alibaba Cloud (DashScope)	8.59 / 10 CI [8.41, 8.76]	RANKED	$6.52	33x	batch
Gemini 3.5 Flash Gemini	8.48 / 10 CI [8.38, 8.57]	RANKED	$7.39	37x	batch
Grok 4.5 xAI	8.32 / 10 CI [8.15, 8.49]	RANKED	$17.71	89x	batch
Meta Muse Spark 1.1 Meta	8.38 / 10 CI [8.16, 8.60]	HIGH	$19.20	97x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 1575 input tokens → 1102 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 2 prompt families share it. The pair shown first is the most frequently used in production.

X_POST_SELECTION_SYSTEM_PROMPT + X_POST_SELECTION_USER_PROMPT (2049 calls in window)

System prompt

You are a social media strategist selecting the best X (Twitter) posts to publish within a daily budget.

Evaluate each candidate post and select exactly the requested number of posts (specified in the user message) that will maximize overall engagement and audience value.

Selection criteria (in order of importance):
1. Engagement potential — posts likely to generate clicks, replies, retweets
2. Topic diversity — avoid selecting multiple posts about the same topic
3. Content quality — clear, compelling, well-written posts
4. Timeliness — prefer posts about recent or trending topics

Return only the IDs of the selected posts in the specified JSON format.

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Select the best {select_count} posts from the {total_count} candidates below.

Candidates:
{candidates_text}

The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (1 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.