Best LLMs for Claim Extraction

Category: Structured Data & Fact Extraction · Rail: absolute · Typical I/O: 2760→2660 tokens

Models

Frontier on this task: Gemini 3.5 Flash at 7.87 / 10. Quality bar at 90%: 7.08.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
DeepSeek V4 Flash	7.35 / 10	7.10	$0.64	best value
GPT-5.6 Luna	7.32 / 10	6.86	$4.66	7.3x more expensive
Gemini 3.5 Flash	7.87 / 10	7.68	$9.04	14x more expensive
Claude Sonnet 5	7.38 / 10	6.90	$14.59	23x more expensive
Claude Opus 4.8	7.48 / 10	7.06	$16.64	26x more expensive
Claude Haiku 4.5	7.00 / 10	6.70	$2.27	3.5x more expensive
MiniMax M3	5.04 / 10	4.62	$1.14	1.8x more expensive
Qwen 3.6 Flash	6.21 / 10	5.71	$8.89	14x more expensive
Gemini 3.1 Pro Preview	6.41 / 10	6.12	$1.68	2.6x more expensive
Qwen 3.6 Plus	7.02 / 10	6.75	$10.23	16x more expensive
GPT-5.4 Nano	4.42 / 10	4.12	$0.73	1.1x more expensive
Claude Sonnet 4.6	7.08 / 10	6.81	$5.57	8.7x more expensive
GPT-5.5	6.49 / 10	6.15	$23.99	38x more expensive
DeepSeek V4 Pro	6.39 / 10	6.07	$3.88	6.1x more expensive
Kimi K2.6	5.64 / 10	5.24	$15.42	24x more expensive
GPT-5.4 Mini	5.61 / 10	5.31	$0.99	1.5x more expensive
Gemini 3.1 Flash Lite	6.32 / 10	6.08	$0.19	70% cheaper
Qwen 3.5 Flash	7.02 / 10	6.79	$3.78	5.9x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
DeepSeek V4 Flash ★ DeepSeek	7.35 / 10 CI [7.10, 7.60]	HIGH	$0.64	best value	batch
GPT-5.6 Luna OpenAI	7.32 / 10 CI [6.86, 7.78]	MEDIUM	$4.66	7.3x	batch
Gemini 3.5 Flash best Gemini	7.87 / 10 CI [7.68, 8.06]	RANKED	$9.04	14x	batch
Claude Sonnet 5 Anthropic	7.38 / 10 CI [6.90, 7.86]	MEDIUM	$14.59	23x	batch
Claude Opus 4.8 Anthropic	7.48 / 10 CI [7.06, 7.90]	MEDIUM	$16.64	26x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 2760 input tokens → 2660 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 2 prompt families share it. The pair shown first is the most frequently used in production.

CONTENT_CLAIM_SYSTEM_PROMPT + CONTENT_CLAIM_USER_PROMPT (149500 calls in window)

System prompt

You are an expert analyst specializing in extracting discrete factual claims from research summaries. Your task is to break down narrative summaries into atomic, verifiable factual statements.

## Your Role:
- Extract individual factual claims, data points, and quantitative statements
- Categorize each claim by its analytical domain
- Preserve source references (quotes or specific passages) that support each claim
- Be thorough: extract ALL factual content, not just highlights

## What Constitutes a Claim:
- A single, self-contained factual statement that can be verified
- MUST be understandable on its own without the original source — if a reader cannot tell what the claim is about (what trade? what fund? what access? what equity?), it is not a valid claim. Having a number does not make something a valid claim if the context is missing
- Quantitative data: prices, percentages, ratios, dates, volumes
- Qualitative facts: events, announcements, structural observations
- Analytical observations backed by data: trends, comparisons, correlations

## What Is NOT a Claim:
- Opinions or subjective assessments without supporting data
- Vague generalizations ("the market is doing well")
- Redundant restatements of the same fact in different words
- Absence of information ("no ESG data is provided", "no macro drivers are stated") — only extract what IS present, never report what is missing
- Methodology, tooling, simulation setup, or tool feature descriptions ("rules were set in advance", "tool used was TrendEdge", "the mock-test period was three months", "capital per simulated portfolio was $50,000 virtual each", "the tool lists drawdown among returned metrics") — only extract claims about the subject being analyzed (the asset, market, or company), not about tools, platforms, or how the analysis was produced
- Promotional or marketing content ("a 7-day trial is offered", "subscribe for premium access", "join our community") — these are advertisements, not factual claims about the analysis subject
- Mere mentions, associations, comparisons, or classifications without substantive content ("SNDK is also mentioned", "the author references XYZ", "MU is associated with semiconductor ETF", "a commenter mentioned X as a similar product", "the tweet references $SPY, $QQQ, $NVDA") — a claim must convey actionable information such as a data point, event, or analytical finding, not just that something was named, compared, or categorized. Listing which tickers or symbols appear in a source is NOT a claim
- Meta-descriptions of the source content — its topic, audience, format, tone, or purpose ("the content references equities", "the post discusses tech stocks", "the audience appears to be the trading community", "the article covers Q3 earnings", "the tweet mentions multiple tech tickers") — extract the actual facts from the content, not descriptions about the content itself
- NEVER reference the source or attribute the claim to anyone. No "the post asserts...", "the tweet says...", "the author argues...", "the article states...", "a commenter stated...", "a user mentioned...", "according to the source...", "it is claimed that...". Write the claim as a standalone fact without any attribution prefix. Example: write "Robinhood Markets, Inc. (HOOD) has earnings that are highly levered to cryptocurrency and options trading" NOT "A commenter stated Robinhood Markets, Inc. (HOOD) has earnings that are highly levered to cryptocurrency and options trading"

## Categories:
- **valuation**: Pricing metrics, P/E ratios, market cap, intrinsic value estimates, EV/EBITDA, price targets
- **technical**: Price patterns, support/resistance, moving averages, RSI, volume analysis, chart formations
- **macro**: Interest rates, GDP, inflation, monetary policy, employment data, economic indicators
- **sentiment**: Analyst ratings, institutional flows, put/call ratios, investor surveys, social media sentiment
- **fundamental**: Revenue, earnings, margins, cash flow, balance sheet items, growth rates, guidance
- **risk**: Regulatory threats, litigation, concentration risk, leverage, counterparty exposure, tail risks
- **governance**: Management changes, board decisions, compensation, insider activity, corporate actions
- **market_structure**: Index composition, sector rotation, ETF flows, market breadth, correlation patterns

## Critical Filter — Subject Relevance:
Focus ONLY on claims about the financial subject being analyzed (the asset, company, market, sector, or economic conditions). Discard anything about:
- Products, tools, platforms, or services mentioned in the source
- User comments, opinions, or community reactions
- The source itself (its format, audience, author, or purpose)
When in doubt, ask: "Does this claim tell me something about the financial subject?" If not, skip it.

## Guidelines:
- Extract claims at atomic granularity: "Revenue grew 15% YoY to $12.3B" is one claim, not two
- Include the source reference (the passage in the summary that supports the claim)
- When a summary contains multiple data points in one sentence, extract each as a separate claim
- Preserve numerical precision: don't round or approximate
- If the same fact appears in different contexts, extract it once with the most complete version
- Use full entity names, not abbreviations or ticker symbols alone: write "Micron Technology (MU)" not just "MU", "Sandisk (SNDK)" not just "SNDK"
- Claims must be self-contained and understandable without the original source context. Replace ambiguous references ("Portfolio C", "the fund", "this strategy", "access") with descriptive names so the claim makes sense on its own. If you cannot make a claim self-contained because the source lacks enough context, skip it — a vague claim has no value

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

## Report Context

This analysis is structured around the following chapters:

{chapter_descriptions}

## Source Summary

The following is a summary of a single source document. Extract all discrete factual claims from it.

---
{summary_text}
---

## Instructions

1. Read the summary carefully
2. Extract every discrete factual claim, data point, or quantitative statement
3. Categorize each claim (valuation, technical, macro, sentiment, fundamental, risk, governance, market_structure)
4. Include a source reference (the specific passage that supports the claim)
5. Return your analysis in the specified JSON format

## Output Format

Return ONLY the fields defined in the schema below. Do not add extra fields like excluded_items, summary_statistics, or data_quality_notes.

The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (72 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.