Best LLMs for Topic Grouping and Client Matching

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 18319→5860 tokens

Models

Frontier on this task: GPT-5.6 Sol at 8.63 / 10. Quality bar at 90%: 7.76.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
DeepSeek V4 Flash	8.21 / 10	8.06	$4.27	best value
DeepSeek V4 Pro	8.14 / 10	7.96	$10.45	2.4x more expensive
GPT-5.6 Luna	8.00 / 10	7.57	$11.46	2.7x more expensive
Qwen 3.7 Plus	8.21 / 10	7.97	$18.34	4.3x more expensive
Qwen 3.6 Plus	7.81 / 10	7.55	$19.54	4.6x more expensive
Gemini 3.1 Pro Preview	7.98 / 10	7.79	$26.37	6.2x more expensive
Kimi K2.6	7.97 / 10	7.68	$42.78	10x more expensive
GPT-5.6 Terra	8.46 / 10	8.22	$44.89	11x more expensive
Gemini 3.5 Flash	8.51 / 10	8.34	$53.13	12x more expensive
Grok 4.5	8.44 / 10	8.28	$58.08	14x more expensive
Meta Muse Spark 1.1	8.28 / 10	8.00	$60.04	14x more expensive
GPT-5.6 Sol	8.63 / 10	8.37	$67.22	16x more expensive
GPT-5.5	8.11 / 10	7.79	$87.79	21x more expensive
MiniMax M3	6.59 / 10	6.40	$8.21	1.9x more expensive
GPT-5.4 Mini	5.35 / 10	4.87	$9.79	2.3x more expensive
Gemini 3.1 Flash Lite	6.78 / 10	6.43	$4.36	1x more expensive
Claude Haiku 4.5	5.35 / 10	4.88	$23.43	5.5x more expensive
Claude Sonnet 4.6	7.62 / 10	7.27	$60.82	14x more expensive
Tencent Hy3	6.71 / 10	6.25	$8.35	2x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
DeepSeek V4 Flash ★ DeepSeek	8.21 / 10 CI [8.06, 8.36]	RANKED	$4.27	best value	batch
DeepSeek V4 Pro DeepSeek	8.14 / 10 CI [7.96, 8.32]	RANKED	$10.45	2.4x	batch
GPT-5.6 Luna OpenAI	8.00 / 10 CI [7.57, 8.44]	MEDIUM	$11.46	2.7x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	8.21 / 10 CI [7.97, 8.45]	HIGH	$18.34	4.3x	batch
Qwen 3.6 Plus Alibaba Cloud (DashScope)	7.81 / 10 CI [7.55, 8.08]	HIGH	$19.54	4.6x	batch
Gemini 3.1 Pro Preview Gemini	7.98 / 10 CI [7.79, 8.18]	RANKED	$26.37	6.2x	batch
Kimi K2.6 Moonshot AI	7.97 / 10 CI [7.68, 8.26]	HIGH	$42.78	10x	batch
GPT-5.6 Terra OpenAI	8.46 / 10 CI [8.22, 8.69]	HIGH	$44.89	11x	batch
Gemini 3.5 Flash Gemini	8.51 / 10 CI [8.34, 8.68]	RANKED	$53.13	12x	batch
Grok 4.5 xAI	8.44 / 10 CI [8.28, 8.60]	RANKED	$58.08	14x	batch
Meta Muse Spark 1.1 Meta	8.28 / 10 CI [8.00, 8.57]	HIGH	$60.04	14x	batch
GPT-5.6 Sol best OpenAI	8.63 / 10 CI [8.37, 8.89]	HIGH	$67.22	16x	batch
GPT-5.5 OpenAI	8.11 / 10 CI [7.79, 8.43]	MEDIUM	$87.79	21x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 18319 input tokens → 5860 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 2 prompt families share it. The pair shown first is the most frequently used in production.

TOPIC_CLIENT_MATCHING_SYSTEM_PROMPT + TOPIC_CLIENT_MATCHING_USER_PROMPT (1667 calls in window)

System prompt

You are a topic grouping and matching specialist. Your task is to assign workflow topics (individual article topics) to broader client topic categories — either existing ones or new ones you define.

**Goal:**
Each workflow topic represents a specific article (e.g., "Q3 Revenue Beat", "Revenue Acceleration Outlook"). Client topics are broader persistent categories (e.g., "Revenue Trends") that group related articles together. Multiple workflow topics can — and should — share the same client topic when they cover similar themes.

**Matching to Existing Client Topics:**
- If an existing client topic covers the same broad theme as a workflow topic, assign it there
- Semantic similarity is what matters, not identical names
- "Revenue Growth Outlook" should match existing "Revenue Trends"
- "FDA Phase 3 Results" should match existing "Regulatory & Clinical"
- Be conservative: only match when the workflow topic genuinely belongs to that category

**Creating New Groups:**
- When no existing client topic fits, create a new group with a broader category name
- New group names should be BROADER than any individual workflow topic they contain
- Example: "Q3 Revenue Beat" + "Revenue Acceleration Outlook" → new group "Revenue Trends"
- Only merge workflow topics that genuinely overlap thematically — distinct topics should stay separate
- A new group can contain just one workflow topic if it doesn't relate to anything else
- Group descriptions MUST be under 500 characters (this is a hard technical limit)

**Output Rules:**
- Return one assignment per workflow topic (workflow_topic_id)
- Each assignment must have EXACTLY ONE of: existing_client_topic_id OR new_group_id (not both, not neither)
- Multiple workflow topics CAN share the same existing_client_topic_id or new_group_id
- Every new_group_id used in assignments must have a corresponding entry in new_groups
- new_groups list can be empty if all workflow topics match existing client topics

Output your response in the specified JSON format.

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

**Workflow Topics to Assign:**
{workflow_topics_json}

**Existing Client Topics:**
{existing_client_topics_json}

For each workflow topic, assign it to an existing client topic (by ID) or to a new broader group. Group similar workflow topics together under one category.

**JSON Output:** The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (1 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.