Best LLMs for Publication Title Generation

Category: Content Summarization & Synthesis · Rail: absolute · Typical I/O: 2924→2387 tokens

Models

Frontier on this task: MiniMax M3 at 8.98 / 10. Quality bar at 90%: 8.08.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
DeepSeek V4 Flash	8.24 / 10	8.04	$0.71	best value
MiniMax M3	8.98 / 10	8.90	$1.62	2.3x more expensive
Tencent Hy3	8.34 / 10	8.15	$2.49	3.5x more expensive
DeepSeek V4 Pro	8.33 / 10	8.18	$3.50	4.9x more expensive
NVIDIA Nemotron-3 Ultra 550B	8.18 / 10	7.80	$4.53	6.4x more expensive
Claude Haiku 4.5	8.16 / 10	8.00	$5.39	7.6x more expensive
Qwen 3.7 Plus	8.64 / 10	8.48	$7.88	11x more expensive
Gemini 3.1 Pro Preview	8.43 / 10	8.32	$10.31	15x more expensive
Qwen 3.6 Plus	8.23 / 10	8.09	$11.33	16x more expensive
Gemini 3.5 Flash	8.80 / 10	8.70	$11.40	16x more expensive
Qwen 3.6 Flash	8.29 / 10	8.17	$12.17	17x more expensive
Claude Sonnet 4.6	8.25 / 10	8.09	$17.15	24x more expensive
Meta Muse Spark 1.1	8.75 / 10	8.53	$19.11	27x more expensive
Kimi K2.6	8.81 / 10	8.72	$23.03	33x more expensive
Claude Opus 4.8	8.93 / 10	8.84	$24.39	34x more expensive
Grok 4.5	8.50 / 10	8.43	$26.90	38x more expensive
GPT-5.5	8.11 / 10	7.93	$33.39	47x more expensive
Claude Sonnet 5	8.81 / 10	8.69	$49.48	70x more expensive
GPT-5.4 Mini	7.73 / 10	7.48	$2.94	4.2x more expensive
GPT-5.4 Nano	7.90 / 10	7.65	$1.71	2.4x more expensive
NVIDIA Nemotron-3 Nano 30B-A3B	7.47 / 10	7.16	$0.97	1.4x more expensive
NVIDIA Nemotron-3 Super 120B	7.91 / 10	7.65	$2.22	3.1x more expensive
Qwen 3.5 Flash	7.63 / 10	7.48	$1.39	2x more expensive
Gemini 3.1 Flash Lite	7.75 / 10	7.58	$1.44	2x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
DeepSeek V4 Flash ★ DeepSeek	8.24 / 10 CI [8.04, 8.44]	RANKED	$0.71	best value	batch
MiniMax M3 best MiniMax	8.98 / 10 CI [8.90, 9.05]	RANKED	$1.62	2.3x	batch
Tencent Hy3 OpenRouter	8.34 / 10 CI [8.15, 8.54]	RANKED	$2.49	3.5x	batch
DeepSeek V4 Pro DeepSeek	8.33 / 10 CI [8.18, 8.49]	RANKED	$3.50	4.9x	batch
NVIDIA Nemotron-3 Ultra 550B OpenRouter	8.18 / 10 CI [7.80, 8.55]	MEDIUM	$4.53	6.4x	batch
Claude Haiku 4.5 Anthropic	8.16 / 10 CI [8.00, 8.33]	RANKED	$5.39	7.6x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	8.64 / 10 CI [8.48, 8.80]	RANKED	$7.88	11x	batch
Gemini 3.1 Pro Preview Gemini	8.43 / 10 CI [8.32, 8.53]	RANKED	$10.31	15x	batch
Qwen 3.6 Plus Alibaba Cloud (DashScope)	8.23 / 10 CI [8.09, 8.38]	RANKED	$11.33	16x	batch
Gemini 3.5 Flash Gemini	8.80 / 10 CI [8.70, 8.91]	RANKED	$11.40	16x	batch
Qwen 3.6 Flash Alibaba Cloud (DashScope)	8.29 / 10 CI [8.17, 8.42]	RANKED	$12.17	17x	batch
Claude Sonnet 4.6 Anthropic	8.25 / 10 CI [8.09, 8.41]	RANKED	$17.15	24x	batch
Meta Muse Spark 1.1 Meta	8.75 / 10 CI [8.53, 8.97]	HIGH	$19.11	27x	batch
Kimi K2.6 Moonshot AI	8.81 / 10 CI [8.72, 8.90]	RANKED	$23.03	33x	batch
Claude Opus 4.8 Anthropic	8.93 / 10 CI [8.84, 9.02]	RANKED	$24.39	34x	batch
Grok 4.5 xAI	8.50 / 10 CI [8.43, 8.57]	RANKED	$26.90	38x	batch
GPT-5.5 OpenAI	8.11 / 10 CI [7.93, 8.29]	RANKED	$33.39	47x	batch
Claude Sonnet 5 Anthropic	8.81 / 10 CI [8.69, 8.93]	RANKED	$49.48	70x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 2924 input tokens → 2387 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 3 prompt families share it. The pair shown first is the most frequently used in production.

PUBLICATION_TITLE_GENERATION_SYSTEM_PROMPT + PUBLICATION_TITLE_GENERATION_USER_PROMPT (3711 calls in window)

System prompt

You are an expert Editor-in-Chief specializing in publication headlines.
Your task is to generate compelling, professional titles and subtitles for research publications.

{audience_context}

For each category provided, generate one title/subtitle pair that:
- Is accurate and reflects the content
- Is engaging and click-worthy without being clickbait
- Follows the category's specific angle/style
- Title: 5-12 words
- Subtitle: 10-20 words providing additional context

After generating all variants, act as the Editor-in-Chief:
- Select the single best variant as your "Editor's Choice"
- Explain in 2-3 sentences WHY this category and title best fits the article's core value and data

For the Editor's Choice, also generate two SEO fields:
- meta_title: The headline as it will appear in Google search results and the
browser tab. Target ≤60 characters — beyond that, Google truncates and the
meaning is lost. It must stand alone (the reader sees no subtitle, no image,
no body). Front-load the subject and the outcome; cut filler words like
"A Look At", "Insights On", "Exploring". The chosen `title` field can be
longer and more expressive — `meta_title` is the search-snippet version.
- meta_description: The single-sentence summary shown under the title in
Google search results. Target ≤155 characters. Present tense, summarises
the takeaway, no clickbait, no trailing ellipsis.

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Generate title/subtitle variants for the following analysis about "{title_context}":

CONTENT:
{content}

Generate one title/subtitle pair for each of the following categories:

{categories_text}

Then select the best variant as your Editor's Choice with a rationale explaining why it's the best fit.

The required JSON output schema is provided in the system prompt.

JUDGE_QUALITY_SYSTEM + JUDGE_QUALITY_USER (39 calls in window)

System prompt

You are a strict evaluator of LLM outputs. Score how well the output fulfills the task on a 0.0–10.0 scale, using the task-specific rubric as the primary criterion.

The "Rubric" in the user message is authoritative: when it constrains or overrides any generic guidance, the rubric wins.

Scoring scale (0.0–10.0):
- 9.0–10.0: Exceptional — comprehensive, accurate, fully meets the task.
- 7.0–8.9: Good — meets most requirements; minor gaps.
- 5.0–6.9: Satisfactory — adequate but with notable limitations or errors.
- 3.0–4.9: Poor — significant gaps, errors, or partial failure.
- 0.0–2.9: Unacceptable — major failure, unusable output.

Use the provided reference examples (if any) to keep your scoring consistent: compare the current output's quality to those already-scored benchmarks and place it on the same scale. Reference examples may come from different models — judge the output on its own merits, using them only to calibrate the scale.

Output JSON matching the schema:
- score: float from 0.0 to 10.0.
- failure_mode: a short tag for the dominant deficiency (e.g. 'hallucination', 'schema_violation', 'truncated', 'off_topic'), or null when none.
- rationale: one to three sentences justifying the score.

User prompt

Rubric: {rubric}
Task: {task_slug}
Domain: {domain}

Input context:
{input_snippet}

Output to grade:
{output_snippet}

Reference examples (already-scored outputs for the same task — use them to keep scoring consistent):
{reference_examples}

Score the output from 0.0 to 10.0 against the rubric, comparing against the reference examples for consistency. Return JSON with score, failure_mode (or null), and rationale.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (5 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.