Best LLMs for Geographic Region Identification

Category: Structured Data & Fact Extraction · Rail: absolute · Typical I/O: 865→1391 tokens

Models

Frontier on this task: Kimi K2.6 at 9.39 / 10. Quality bar at 90%: 8.45.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
GPT-5.6 Luna	8.74 / 10	8.50	$1.24	best value
DeepSeek V4 Pro	9.04 / 10	8.71	$2.04	1.6x more expensive
GPT-5.6 Terra	8.72 / 10	8.51	$2.59	2.1x more expensive
NVIDIA Nemotron-3 Ultra 550B	8.60 / 10	8.36	$5.53	4.4x more expensive
Kimi K2.6	9.39 / 10	9.23	$8.84	7.1x more expensive
GPT-5.5	8.77 / 10	8.58	$14.08	11x more expensive
Claude Sonnet 4.6	9.10 / 10	8.81	$19.31	16x more expensive
Gemini 3.1 Flash Lite	7.91 / 10	7.73	$0.48	62% cheaper
Meta Muse Spark 1.1	8.31 / 10	7.90	$11.74	9.4x more expensive
Grok 4.5	8.13 / 10	8.02	$10.31	8.3x more expensive
Gemini 3.5 Flash	7.32 / 10	7.03	$3.34	2.7x more expensive
Claude Sonnet 5	8.44 / 10	8.26	$4.46	3.6x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
GPT-5.6 Luna ★ OpenAI	8.74 / 10 CI [8.50, 8.99]	HIGH	$1.24	best value	batch
DeepSeek V4 Pro DeepSeek	9.04 / 10 CI [8.71, 9.36]	MEDIUM	$2.04	1.6x	batch
GPT-5.6 Terra OpenAI	8.72 / 10 CI [8.51, 8.93]	HIGH	$2.59	2.1x	batch
NVIDIA Nemotron-3 Ultra 550B OpenRouter	8.60 / 10 CI [8.36, 8.84]	HIGH	$5.53	4.4x	batch
Kimi K2.6 best Moonshot AI	9.39 / 10 CI [9.23, 9.54]	RANKED	$8.84	7.1x	batch
GPT-5.5 OpenAI	8.77 / 10 CI [8.58, 8.95]	RANKED	$14.08	11x	batch
Claude Sonnet 4.6 Anthropic	9.10 / 10 CI [8.81, 9.40]	HIGH	$19.31	16x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 865 input tokens → 1391 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

The system + user template pair used for this task.

RESEARCH_REGION_IDENTIFIER_SYSTEM + RESEARCH_REGION_IDENTIFIER_USER (1074 calls in window)

System prompt

You are an expert market analyst specializing in geographic market identification.

Your task is to identify the most relevant geographic regions for researching a specific subject.

Consider:
1. Where is the subject primarily headquartered or based?
2. What are the subject's primary markets?
3. Where does the subject have significant operations?
4. Where would authoritative information be published?
5. What regulatory jurisdictions are most relevant?

Output a JSON object with:
{{
  "region_codes": ["US", "UK", ...],
  "reasoning": "Brief explanation of why these regions were selected"
}}

Select 1-4 most relevant regions. Prioritize quality over quantity.

User prompt

Identify the most relevant geographic regions for researching: {subject_name}

Subject Code: {subject_code}
Subject Type: {subject_type}
Subject Description: {subject_description}

Available Regions:
{available_regions}

Additional Context:
{additional_context}