Best LLMs for Metadata Paragraph Rewriting

Category: Infrastructure & Utility · Rail: absolute · Typical I/O: 409→552 tokens

Models

Frontier on this task: Claude Sonnet 5 at 9.51 / 10. Quality bar at 90%: 8.56.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
Claude Sonnet 5	9.51 / 10	9.40	$1.71	best value
Gemini 3.5 Flash	9.10 / 10	8.90	$5.93	3.5x more expensive
GPT-5.4 Mini	6.68 / 10	6.30	$0.41	76% cheaper
GPT-5.5	7.61 / 10	7.27	$4.92	2.9x more expensive
Claude Haiku 4.5	5.61 / 10	5.11	$1.03	40% cheaper
Gemini 3.1 Flash Lite	7.58 / 10	7.35	$0.27	84% cheaper
DeepSeek V4 Pro	8.18 / 10	7.88	$0.45	74% cheaper
Qwen 3.6 Plus	8.50 / 10	8.27	$1.41	18% cheaper
Qwen 3.5 Flash	8.07 / 10	7.83	$0.74	57% cheaper
GPT-5.4 Nano	7.15 / 10	6.80	$0.22	87% cheaper
Gemini 3.1 Pro Preview	7.45 / 10	7.19	$1.75	1x more expensive
NVIDIA Nemotron-3 Nano 30B-A3B	6.55 / 10	6.17	$0.13	92% cheaper
NVIDIA Nemotron-3 Ultra 550B	8.55 / 10	8.25	$2.23	1.3x more expensive
GPT-5.6 Sol	6.09 / 10	5.81	$2.96	1.7x more expensive
NVIDIA Nemotron-3 Super 120B	7.35 / 10	6.85	$0.34	80% cheaper
Qwen 3.7 Plus	7.18 / 10	7.01	$4.91	2.9x more expensive
GPT-5.6 Terra	6.40 / 10	6.16	$1.11	35% cheaper
Tencent Hy3	6.70 / 10	6.47	$0.52	69% cheaper
GPT-5.6 Luna	6.34 / 10	6.09	$0.60	65% cheaper
Meta Muse Spark 1.1	5.80 / 10	5.54	$9.62	5.6x more expensive
Qwen 3.6 Flash	6.96 / 10	6.82	$5.46	3.2x more expensive
Grok 4.5	6.11 / 10	5.92	$10.72	6.3x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
Claude Sonnet 5 ★ best Anthropic	9.51 / 10 CI [9.40, 9.61]	RANKED	$1.71	best value	batch
Gemini 3.5 Flash Gemini	9.10 / 10 CI [8.90, 9.31]	HIGH	$5.93	3.5x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 409 input tokens → 552 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 2 prompt families share it. The pair shown first is the most frequently used in production.

METADATA_PARAGRAPH_IMPROVEMENT_SYSTEM_PROMPT + METADATA_PARAGRAPH_IMPROVEMENT_USER_PROMPT (1647 calls in window)

System prompt

You are an expert editor specializing in creating polished, professional metadata paragraphs for research reports.

Your task is to take a draft metadata paragraph that describes the research period and sources, and improve its wording to make it:
- More professional and polished
- Clear and concise
- Reader-friendly
- Accurate to the data provided

IMPORTANT GUIDELINES:
1. Preserve all factual information (dates, numbers, source types)
2. Do NOT add information that wasn't in the original
3. Do NOT change the meaning or facts
4. Keep the paragraph approximately the same length
5. Maintain a professional, informative tone
6. Start with a clear statement about the research period
7. Include source count and types naturally in the flow

The improved paragraph should be suitable for inclusion at the beginning of a professional research report or analysis document.

User prompt

Please improve the wording of this metadata paragraph while preserving all factual information:

{draft_paragraph}

Key facts to preserve:
- Date range: {date_from} to {date_to}
- Number of sources: {source_count}
- Content types: {content_type_summary}

Return the improved paragraph that maintains these facts while improving clarity and professionalism.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (5 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.