Best LLMs for Content Summarization

Category: Content Summarization & Synthesis · Rail: absolute · Typical I/O: 3513→2364 tokens

Models

Frontier on this task: GPT-5.4 Nano at 9.48 / 10. Quality bar at 90%: 8.53.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
GPT-5.4 Nano	9.48 / 10	9.36	$1.36	best value
Tencent Hy3	8.73 / 10	8.45	$2.59	1.9x more expensive
GPT-5.6 Luna	9.11 / 10	8.86	$10.43	7.7x more expensive
Kimi K2.6	8.55 / 10	8.39	$15.07	11x more expensive
GPT-5.6 Sol	9.00 / 10	8.56	$41.46	31x more expensive
Claude Haiku 4.5	7.28 / 10	7.10	$7.36	5.4x more expensive
MiniMax M3	8.42 / 10	8.26	$1.27	7% cheaper
NVIDIA Nemotron-3 Ultra 550B	7.83 / 10	7.47	$10.62	7.8x more expensive
Qwen 3.6 Plus	7.03 / 10	6.59	$20.99	15x more expensive
DeepSeek V4 Pro	7.19 / 10	6.97	$5.93	4.4x more expensive
Qwen 3.5 Flash	7.63 / 10	7.39	$3.53	2.6x more expensive
Meta Muse Spark 1.1	8.43 / 10	8.09	$22.06	16x more expensive
Gemini 3.1 Pro Preview	6.30 / 10	6.10	$1.96	1.4x more expensive
GPT-5.5	6.87 / 10	6.57	$28.83	21x more expensive
Gemini 3.5 Flash	6.23 / 10	5.88	$28.85	21x more expensive
Claude Sonnet 5	7.95 / 10	7.82	$15.85	12x more expensive
Claude Opus 4.8	8.24 / 10	8.05	$32.62	24x more expensive
Claude Sonnet 4.6	7.50 / 10	7.34	$14.03	10x more expensive
Grok 4.5	8.30 / 10	8.17	$20.84	15x more expensive
GPT-5.4 Mini	7.08 / 10	6.78	$2.35	1.7x more expensive
Qwen 3.7 Plus	7.94 / 10	7.78	$6.86	5x more expensive
Qwen 3.6 Flash	6.97 / 10	6.52	$11.03	8.1x more expensive
DeepSeek V4 Flash	7.07 / 10	6.85	$1.44	1.1x more expensive
Gemini 3.1 Flash Lite	6.10 / 10	5.89	$0.35	74% cheaper
GPT-5.6 Terra	8.38 / 10	8.09	$13.31	9.8x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
GPT-5.4 Nano ★ best OpenAI	9.48 / 10 CI [9.36, 9.60]	RANKED	$1.36	best value	batch
Tencent Hy3 OpenRouter	8.73 / 10 CI [8.45, 9.00]	HIGH	$2.59	1.9x	batch
GPT-5.6 Luna OpenAI	9.11 / 10 CI [8.86, 9.36]	HIGH	$10.43	7.7x	batch
Kimi K2.6 Moonshot AI	8.55 / 10 CI [8.39, 8.71]	RANKED	$15.07	11x	batch
GPT-5.6 Sol OpenAI	9.00 / 10 CI [8.56, 9.44]	MEDIUM	$41.46	31x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 3513 input tokens → 2364 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 3 prompt families share it. The pair shown first is the most frequently used in production.

CONTENT_SUMMARIZATION_SYSTEM_PROMPT + CONTENT_SUMMARIZATION_USER_PROMPT (83653 calls in window)

System prompt

You are an expert content analyst. Your task is to create a comprehensive summary of the provided content, extracting ALL information that is relevant to the report structure requirements provided in the user message.

## Instructions:
1. Extract ALL facts, data points, quotes, and insights relevant to ANY of the report chapters listed in the user message
2. Do NOT impose arbitrary length limits - capture everything relevant
3. Focus on substance over style - preserve key details, statistics, and specific claims
4. Remove irrelevant tangents, formatting artifacts, and social media noise
5. Write in clear, objective prose
6. If the content is highly relevant, your summary may be very long - that's expected

Your summary will be used for topic clustering and synthesis, so completeness is more important than brevity.

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Please summarize the following content, extracting all information relevant to the report structure requirements.

## Report Structure Requirements:
{report_structure}

## Content Metadata:
- Title: {content_title}
- Source: {content_url}
- Type: {content_type}

## Content:
{content_text}

## Required Output Format:
The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (367 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.

JUDGE_QUALITY_SYSTEM + JUDGE_QUALITY_USER (5 calls in window)

System prompt

You are a strict evaluator of LLM outputs. Score how well the output fulfills the task on a 0.0–10.0 scale, using the task-specific rubric as the primary criterion.

The "Rubric" in the user message is authoritative: when it constrains or overrides any generic guidance, the rubric wins.

Scoring scale (0.0–10.0):
- 9.0–10.0: Exceptional — comprehensive, accurate, fully meets the task.
- 7.0–8.9: Good — meets most requirements; minor gaps.
- 5.0–6.9: Satisfactory — adequate but with notable limitations or errors.
- 3.0–4.9: Poor — significant gaps, errors, or partial failure.
- 0.0–2.9: Unacceptable — major failure, unusable output.

Use the provided reference examples (if any) to keep your scoring consistent: compare the current output's quality to those already-scored benchmarks and place it on the same scale. Reference examples may come from different models — judge the output on its own merits, using them only to calibrate the scale.

Output JSON matching the schema:
- score: float from 0.0 to 10.0.
- failure_mode: a short tag for the dominant deficiency (e.g. 'hallucination', 'schema_violation', 'truncated', 'off_topic'), or null when none.
- rationale: one to three sentences justifying the score.

User prompt

Rubric: {rubric}
Task: {task_slug}
Domain: {domain}

Input context:
{input_snippet}

Output to grade:
{output_snippet}

Reference examples (already-scored outputs for the same task — use them to keep scoring consistent):
{reference_examples}

Score the output from 0.0 to 10.0 against the rubric, comparing against the reference examples for consistency. Return JSON with score, failure_mode (or null), and rationale.