Best LLMs for Activity Feed Blurb Generation

Category: Social & Promotional Content · Rail: absolute · Typical I/O: 1670→1099 tokens

Models

Frontier on this task: Tencent Hy3 at 8.81 / 10. Quality bar at 90%: 7.93.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
DeepSeek V4 Flash	8.38 / 10	8.07	$0.28	best value
NVIDIA Nemotron-3 Nano 30B-A3B	8.00 / 10	7.71	$0.30	1.1x more expensive
GPT-5.4 Mini	8.64 / 10	8.47	$0.52	1.9x more expensive
GPT-5.6 Luna	8.36 / 10	8.21	$0.78	2.8x more expensive
NVIDIA Nemotron-3 Super 120B	8.30 / 10	8.06	$0.85	3.1x more expensive
MiniMax M3	8.36 / 10	7.94	$0.90	3.2x more expensive
Tencent Hy3	8.81 / 10	8.72	$1.21	4.4x more expensive
DeepSeek V4 Pro	8.58 / 10	8.33	$1.24	4.5x more expensive
GPT-5.6 Terra	8.38 / 10	8.24	$1.86	6.7x more expensive
Qwen 3.5 Flash	8.34 / 10	8.11	$2.53	9.1x more expensive
GPT-5.6 Sol	8.20 / 10	8.02	$4.33	16x more expensive
Gemini 3.1 Pro Preview	8.18 / 10	8.02	$4.95	18x more expensive
NVIDIA Nemotron-3 Ultra 550B	8.69 / 10	8.46	$4.96	18x more expensive
Claude Sonnet 4.6	8.19 / 10	7.80	$5.08	18x more expensive
Gemini 3.5 Flash	8.03 / 10	7.85	$5.88	21x more expensive
Qwen 3.7 Plus	8.33 / 10	8.18	$5.93	21x more expensive
GPT-5.5	8.65 / 10	8.47	$6.08	22x more expensive
Qwen 3.6 Flash	7.94 / 10	7.74	$6.74	24x more expensive
Claude Sonnet 5	8.74 / 10	8.64	$6.98	25x more expensive
Qwen 3.6 Plus	8.52 / 10	8.33	$7.85	28x more expensive
Claude Opus 4.8	8.76 / 10	8.57	$9.66	35x more expensive
Meta Muse Spark 1.1	8.47 / 10	8.29	$10.18	37x more expensive
Kimi K2.6	8.66 / 10	8.50	$12.33	44x more expensive
Grok 4.5	7.95 / 10	7.77	$14.37	52x more expensive
Gemini 3.1 Flash Lite	7.64 / 10	7.38	$0.35	1.2x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
DeepSeek V4 Flash ★ DeepSeek	8.38 / 10 CI [8.07, 8.69]	MEDIUM	$0.28	best value	batch
NVIDIA Nemotron-3 Nano 30B-A3B OpenRouter	8.00 / 10 CI [7.71, 8.29]	HIGH	$0.30	1.1x	batch
GPT-5.4 Mini OpenAI	8.64 / 10 CI [8.47, 8.81]	RANKED	$0.52	1.9x	batch
GPT-5.6 Luna OpenAI	8.36 / 10 CI [8.21, 8.50]	RANKED	$0.78	2.8x	batch
NVIDIA Nemotron-3 Super 120B OpenRouter	8.30 / 10 CI [8.06, 8.54]	HIGH	$0.85	3.1x	batch
MiniMax M3 MiniMax	8.36 / 10 CI [7.94, 8.78]	MEDIUM	$0.90	3.2x	batch
Tencent Hy3 best OpenRouter	8.81 / 10 CI [8.72, 8.89]	RANKED	$1.21	4.4x	batch
DeepSeek V4 Pro DeepSeek	8.58 / 10 CI [8.33, 8.83]	HIGH	$1.24	4.5x	batch
GPT-5.6 Terra OpenAI	8.38 / 10 CI [8.24, 8.51]	RANKED	$1.86	6.7x	batch
Qwen 3.5 Flash Alibaba Cloud (DashScope)	8.34 / 10 CI [8.11, 8.58]	HIGH	$2.53	9.1x	batch
GPT-5.6 Sol OpenAI	8.20 / 10 CI [8.02, 8.38]	RANKED	$4.33	16x	batch
Gemini 3.1 Pro Preview Gemini	8.18 / 10 CI [8.02, 8.35]	RANKED	$4.95	18x	batch
NVIDIA Nemotron-3 Ultra 550B OpenRouter	8.69 / 10 CI [8.46, 8.92]	HIGH	$4.96	18x	batch
Claude Sonnet 4.6 Anthropic	8.19 / 10 CI [7.80, 8.59]	MEDIUM	$5.08	18x	batch
Gemini 3.5 Flash Gemini	8.03 / 10 CI [7.85, 8.22]	RANKED	$5.88	21x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	8.33 / 10 CI [8.18, 8.47]	RANKED	$5.93	21x	batch
GPT-5.5 OpenAI	8.65 / 10 CI [8.47, 8.84]	RANKED	$6.08	22x	batch
Qwen 3.6 Flash Alibaba Cloud (DashScope)	7.94 / 10 CI [7.74, 8.13]	RANKED	$6.74	24x	batch
Claude Sonnet 5 Anthropic	8.74 / 10 CI [8.64, 8.84]	RANKED	$6.98	25x	batch
Qwen 3.6 Plus Alibaba Cloud (DashScope)	8.52 / 10 CI [8.33, 8.71]	RANKED	$7.85	28x	batch
Claude Opus 4.8 Anthropic	8.76 / 10 CI [8.57, 8.95]	RANKED	$9.66	35x	batch
Meta Muse Spark 1.1 Meta	8.47 / 10 CI [8.29, 8.65]	RANKED	$10.18	37x	batch
Kimi K2.6 Moonshot AI	8.66 / 10 CI [8.50, 8.82]	RANKED	$12.33	44x	batch
Grok 4.5 xAI	7.95 / 10 CI [7.77, 8.13]	RANKED	$14.37	52x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 1670 input tokens → 1099 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Evaluation rubric

The output under evaluation is a short promotional blurb (1-2 sentences) for
a client's activity feed, generated from a published report's title and
content.

Judge ONLY the editorial quality of the blurb:
- Accuracy: it reflects the report's actual thesis; no invented facts,
  figures, or claims that the report does not support.
- Promotional fit: specific and engaging — it names or clearly evokes the
  client and gives a concrete reason to read the report; it reads as a hook,
  not a generic announcement.
- Style: clear, concise, professional finance/news voice; no clickbait
  phrasing, no filler.

OUT OF SCOPE — do not reward or penalize:
- Character/length-limit compliance. The length cap is enforced by a
  deterministic validator before any output is accepted, so every output you
  see already complies. Do NOT count characters or grade length.
- Output structure, schema, field naming, or formatting details.

Prompt templates

This is a pooled capability — 2 prompt families share it. The pair shown first is the most frequently used in production.

ACTIVITY_PROMO_SYSTEM_PROMPT + ACTIVITY_PROMO_USER_PROMPT (4741 calls in window)

System prompt

You are an expert copywriter for a financial research platform. Your task is to write a short, engaging promotional blurb for an activity feed.

The blurb should:
- Be 1-2 sentences, under 200 characters total
- Spark curiosity and encourage clicking through
- Reference the client/brand name naturally
- Feel like an editorial teaser, not an advertisement
- Avoid clickbait, hyperbole, or exclamation marks
- Use present tense

Output your result in the specified JSON format.

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Client: {client_name}
Post title: {post_title}

Full report text:
{report_text}

Write a short promotional blurb for this post to appear in the activity feed of other clients' pages.

The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (9 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.