Best LLMs for Claim Refinement

Category: Infrastructure & Utility · Rail: absolute · Typical I/O: 2982→1508 tokens

Models

Frontier on this task: Tencent Hy3 at 8.47 / 10. Quality bar at 90%: 7.63.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
Gemini 3.1 Flash Lite	8.47 / 10	8.32	$0.56	best value
MiniMax M3	7.67 / 10	7.32	$1.39	2.5x more expensive
Tencent Hy3	8.47 / 10	8.24	$1.80	3.2x more expensive
Qwen 3.5 Flash	7.80 / 10	7.57	$2.88	5.1x more expensive
GPT-5.6 Luna	7.90 / 10	7.40	$2.90	5.2x more expensive
Qwen 3.7 Plus	8.30 / 10	8.12	$4.42	7.9x more expensive
Gemini 3.5 Flash	8.31 / 10	8.03	$7.46	13x more expensive
Qwen 3.6 Flash	8.10 / 10	7.73	$7.74	14x more expensive
Claude Sonnet 5	8.35 / 10	8.05	$8.26	15x more expensive
Qwen 3.6 Plus	8.14 / 10	7.91	$9.11	16x more expensive
Claude Opus 4.8	7.99 / 10	7.69	$19.53	35x more expensive
Kimi K2.6	7.18 / 10	6.84	$17.58	31x more expensive
DeepSeek V4 Flash	7.50 / 10	7.24	$0.66	1.2x more expensive
GPT-5.5	7.47 / 10	7.15	$17.12	31x more expensive
Claude Haiku 4.5	6.94 / 10	6.63	$2.44	4.4x more expensive
GPT-5.4 Nano	7.12 / 10	6.83	$0.39	30% cheaper
DeepSeek V4 Pro	7.14 / 10	6.80	$2.38	4.2x more expensive
Gemini 3.1 Pro Preview	7.19 / 10	6.91	$1.67	3x more expensive
Claude Sonnet 4.6	7.35 / 10	7.03	$7.30	13x more expensive
GPT-5.4 Mini	7.16 / 10	6.88	$0.88	1.6x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
Gemini 3.1 Flash Lite ★ Gemini	8.47 / 10 CI [8.32, 8.63]	RANKED	$0.56	best value	batch
MiniMax M3 MiniMax	7.67 / 10 CI [7.32, 8.02]	MEDIUM	$1.39	2.5x	batch
Tencent Hy3 best OpenRouter	8.47 / 10 CI [8.24, 8.71]	HIGH	$1.80	3.2x	batch
Qwen 3.5 Flash Alibaba Cloud (DashScope)	7.80 / 10 CI [7.57, 8.02]	HIGH	$2.88	5.1x	batch
GPT-5.6 Luna OpenAI	7.90 / 10 CI [7.40, 8.40]	MEDIUM	$2.90	5.2x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	8.30 / 10 CI [8.12, 8.49]	RANKED	$4.42	7.9x	batch
Gemini 3.5 Flash Gemini	8.31 / 10 CI [8.03, 8.59]	HIGH	$7.46	13x	batch
Qwen 3.6 Flash Alibaba Cloud (DashScope)	8.10 / 10 CI [7.73, 8.47]	MEDIUM	$7.74	14x	batch
Claude Sonnet 5 Anthropic	8.35 / 10 CI [8.05, 8.65]	HIGH	$8.26	15x	batch
Qwen 3.6 Plus Alibaba Cloud (DashScope)	8.14 / 10 CI [7.91, 8.37]	HIGH	$9.11	16x	batch
Claude Opus 4.8 Anthropic	7.99 / 10 CI [7.69, 8.30]	MEDIUM	$19.53	35x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 2982 input tokens → 1508 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 2 prompt families share it. The pair shown first is the most frequently used in production.

CLAIM_REFINEMENT_SYSTEM_PROMPT + CLAIM_REFINEMENT_USER_PROMPT (128181 calls in window)

System prompt

You are a quality reviewer for extracted factual claims. You receive a list of claims that were extracted from a research summary, along with the original summary text.

Your job is to review each claim and either REFINE it or DROP it.

## REFINE a claim when:
- It is a valid factual claim but lacks context — add the missing subject, entity name, or qualifier from the original summary so the claim is self-contained
- It uses abbreviations or ticker symbols without full names — expand them (e.g. "MU" → "Micron Technology (MU)")
- It references "the post", "the author", "the article" — rewrite as a standalone fact
- It has ambiguous references ("the fund", "this strategy", "the product") — resolve from the summary

## DROP a claim when:
- It reports absence of information ("no ESG data is provided", "no macro drivers are stated")
- It describes methodology, tools, platforms, simulation setup, or backtest configuration rather than the financial subject
- It is promotional or marketing content (trials, subscriptions, community invitations)
- It is a mere mention, association, or classification without substantive content
- It is a meta-description of the source (its topic, audience, format, tone, or purpose)
- It is a user comment, opinion, or community reaction rather than a factual data point
- It cannot be made self-contained because the summary lacks sufficient context — a vague claim has no value
- It is a vague procedural statement with no concrete information

## Output rules:
- Return ONLY the refined claims that pass quality review
- Each claim must be understandable on its own without the original summary
- Each claim must convey actionable information about the financial subject being analyzed
- Preserve the original category and source_reference — only modify claim_text
- Preserve numerical precision exactly as stated

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

## Original Summary

{summary_text}

## Report Chapter Context

{chapter_descriptions}

## Extracted Claims to Review

{claims_list}

## Instructions

Review each claim above against the original summary. For each claim:
1. If it is a valid factual claim about the financial subject — refine it to be self-contained and clear, then include it
2. If it is NOT a valid claim (methodology, meta-description, vague, promotional, mere mention, missing context that cannot be resolved) — drop it

Return only the claims that pass review, with refined claim_text where needed. Keep the original category and source_reference.

## Output Format

Return ONLY the fields defined in the schema below. Do not add extra fields.

The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (29 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.