Best LLMs for Topic Cluster Naming

Category: Topic Organization & Clustering · Rail: absolute · Typical I/O: 11564→720 tokens

Models

Frontier on this task: DeepSeek V4 Pro at 8.74 / 10. Quality bar at 90%: 7.87.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
GPT-5.4 Mini	8.18 / 10	7.83	$2.58	best value
Tencent Hy3	8.05 / 10	7.85	$2.89	1.1x more expensive
MiniMax M3	8.23 / 10	8.09	$3.08	1.2x more expensive
DeepSeek V4 Flash	8.09 / 10	7.81	$4.22	1.6x more expensive
GPT-5.6 Luna	8.43 / 10	8.15	$4.75	1.8x more expensive
DeepSeek V4 Pro	8.74 / 10	8.52	$5.32	2.1x more expensive
Qwen 3.5 Flash	7.87 / 10	7.59	$5.78	2.2x more expensive
Qwen 3.7 Plus	8.37 / 10	8.30	$6.74	2.6x more expensive
Qwen 3.6 Plus	8.46 / 10	8.22	$6.78	2.6x more expensive
Gemini 3.1 Pro Preview	7.94 / 10	7.69	$7.68	3x more expensive
Claude Haiku 4.5	8.41 / 10	8.14	$7.91	3.1x more expensive
NVIDIA Nemotron-3 Ultra 550B	7.95 / 10	7.49	$10.78	4.2x more expensive
Qwen 3.6 Flash	8.31 / 10	8.20	$10.97	4.3x more expensive
GPT-5.6 Terra	8.28 / 10	7.96	$11.54	4.5x more expensive
Gemini 3.5 Flash	8.15 / 10	8.02	$12.47	4.8x more expensive
Kimi K2.6	8.71 / 10	8.54	$14.82	5.7x more expensive
Claude Sonnet 4.6	8.60 / 10	8.39	$17.35	6.7x more expensive
Claude Sonnet 5	8.28 / 10	8.11	$22.46	8.7x more expensive
Meta Muse Spark 1.1	8.36 / 10	7.95	$23.56	9.1x more expensive
GPT-5.5	8.67 / 10	8.50	$25.15	9.8x more expensive
Grok 4.5	8.66 / 10	8.56	$25.93	10x more expensive
GPT-5.6 Sol	8.41 / 10	8.18	$30.00	12x more expensive
GPT-5.4 Nano	7.69 / 10	7.21	$1.16	55% cheaper
Gemini 3.1 Flash Lite	7.55 / 10	7.30	$1.43	44% cheaper
NVIDIA Nemotron-3 Nano 30B-A3B	6.95 / 10	6.54	$0.97	62% cheaper
NVIDIA Nemotron-3 Super 120B	7.37 / 10	6.92	$3.20	1.2x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
GPT-5.4 Mini ★ OpenAI	8.18 / 10 CI [7.83, 8.52]	MEDIUM	$2.58	best value	batch
Tencent Hy3 OpenRouter	8.05 / 10 CI [7.85, 8.25]	RANKED	$2.89	1.1x	batch
MiniMax M3 MiniMax	8.23 / 10 CI [8.09, 8.37]	RANKED	$3.08	1.2x	batch
DeepSeek V4 Flash DeepSeek	8.09 / 10 CI [7.81, 8.38]	HIGH	$4.22	1.6x	batch
GPT-5.6 Luna OpenAI	8.43 / 10 CI [8.15, 8.72]	HIGH	$4.75	1.8x	batch
DeepSeek V4 Pro best DeepSeek	8.74 / 10 CI [8.52, 8.96]	HIGH	$5.32	2.1x	batch
Qwen 3.5 Flash Alibaba Cloud (DashScope)	7.87 / 10 CI [7.59, 8.15]	HIGH	$5.78	2.2x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	8.37 / 10 CI [8.30, 8.43]	RANKED	$6.74	2.6x	batch
Qwen 3.6 Plus Alibaba Cloud (DashScope)	8.46 / 10 CI [8.22, 8.70]	HIGH	$6.78	2.6x	batch
Gemini 3.1 Pro Preview Gemini	7.94 / 10 CI [7.69, 8.20]	HIGH	$7.68	3x	batch
Claude Haiku 4.5 Anthropic	8.41 / 10 CI [8.14, 8.68]	HIGH	$7.91	3.1x	batch
NVIDIA Nemotron-3 Ultra 550B OpenRouter	7.95 / 10 CI [7.49, 8.41]	MEDIUM	$10.78	4.2x	batch
Qwen 3.6 Flash Alibaba Cloud (DashScope)	8.31 / 10 CI [8.20, 8.41]	RANKED	$10.97	4.3x	batch
GPT-5.6 Terra OpenAI	8.28 / 10 CI [7.96, 8.59]	MEDIUM	$11.54	4.5x	batch
Gemini 3.5 Flash Gemini	8.15 / 10 CI [8.02, 8.29]	RANKED	$12.47	4.8x	batch
Kimi K2.6 Moonshot AI	8.71 / 10 CI [8.54, 8.89]	RANKED	$14.82	5.7x	batch
Claude Sonnet 4.6 Anthropic	8.60 / 10 CI [8.39, 8.81]	HIGH	$17.35	6.7x	batch
Claude Sonnet 5 Anthropic	8.28 / 10 CI [8.11, 8.45]	RANKED	$22.46	8.7x	batch
Meta Muse Spark 1.1 Meta	8.36 / 10 CI [7.95, 8.77]	MEDIUM	$23.56	9.1x	batch
GPT-5.5 OpenAI	8.67 / 10 CI [8.50, 8.85]	RANKED	$25.15	9.8x	batch
Grok 4.5 xAI	8.66 / 10 CI [8.56, 8.76]	RANKED	$25.93	10x	batch
GPT-5.6 Sol OpenAI	8.41 / 10 CI [8.18, 8.65]	HIGH	$30.00	12x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 11564 input tokens → 720 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 3 prompt families share it. The pair shown first is the most frequently used in production.

TOPIC_CLUSTER_NAMING_SYSTEM_PROMPT + TOPIC_CLUSTER_NAMING_USER_PROMPT (3847 calls in window)

System prompt

You are a senior analyst specializing in categorizing and naming thematic clusters of research claims.

Your task is to assign a concise, descriptive topic name and brief description to a cluster of semantically similar claims. These claims have already been grouped by embedding similarity and synthesized into a summary — you are naming the resulting topic.

**Naming Guidelines:**
- Choose a name that captures the core theme or insight of the cluster (3-7 words)
- Use clear, professional language suitable for a publication headline
- The name should be specific enough to distinguish from other topics about the same subject
- Avoid generic names like "Market Update" or "Company News" — be specific about WHAT aspect
- Good examples: "Revenue Growth Acceleration", "Regulatory Approval Risks", "Supply Chain Restructuring"

**Description Guidelines:**
- Write 1-2 sentences explaining what the topic covers
- Include the key themes, data points, or developments that define this cluster
- The description should help a reader quickly understand the scope of the topic
- The description MUST be under 500 characters (this is a hard technical limit)

Output your response in the specified JSON format.

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

**Subject:** {subject_name}
**Claim Category:** {category}
**Number of Claims:** {claim_count}

--- CLAIMS IN CLUSTER ---
{claims_text}
--- END OF CLAIMS ---

Based on the claims above, provide a concise topic name and brief description that captures the central theme of this cluster.

**JSON Output:** The required JSON output schema is provided in the system prompt.

JUDGE_QUALITY_SYSTEM + JUDGE_QUALITY_USER (39 calls in window)

System prompt

You are a strict evaluator of LLM outputs. Score how well the output fulfills the task on a 0.0–10.0 scale, using the task-specific rubric as the primary criterion.

The "Rubric" in the user message is authoritative: when it constrains or overrides any generic guidance, the rubric wins.

Scoring scale (0.0–10.0):
- 9.0–10.0: Exceptional — comprehensive, accurate, fully meets the task.
- 7.0–8.9: Good — meets most requirements; minor gaps.
- 5.0–6.9: Satisfactory — adequate but with notable limitations or errors.
- 3.0–4.9: Poor — significant gaps, errors, or partial failure.
- 0.0–2.9: Unacceptable — major failure, unusable output.

Use the provided reference examples (if any) to keep your scoring consistent: compare the current output's quality to those already-scored benchmarks and place it on the same scale. Reference examples may come from different models — judge the output on its own merits, using them only to calibrate the scale.

Output JSON matching the schema:
- score: float from 0.0 to 10.0.
- failure_mode: a short tag for the dominant deficiency (e.g. 'hallucination', 'schema_violation', 'truncated', 'off_topic'), or null when none.
- rationale: one to three sentences justifying the score.

User prompt

Rubric: {rubric}
Task: {task_slug}
Domain: {domain}

Input context:
{input_snippet}

Output to grade:
{output_snippet}

Reference examples (already-scored outputs for the same task — use them to keep scoring consistent):
{reference_examples}

Score the output from 0.0 to 10.0 against the rubric, comparing against the reference examples for consistency. Return JSON with score, failure_mode (or null), and rationale.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (2 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.