Best LLMs for Author Voice Generation

Category: Long-form Content Generation · Rail: absolute · Typical I/O: 921→1408 tokens

Models

Frontier on this task: MiniMax M3 at 9.71 / 10. Quality bar at 90%: 8.74.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
NVIDIA Nemotron-3 Super 120B	8.74 / 10	8.54	$0.83	best value
DeepSeek V4 Flash	9.21 / 10	9.11	$1.53	1.8x more expensive
MiniMax M3	9.71 / 10	9.68	$1.79	2.1x more expensive
GPT-5.4 Nano	8.78 / 10	8.66	$3.83	4.6x more expensive
GPT-5.6 Luna	9.02 / 10	8.83	$3.95	4.7x more expensive
NVIDIA Nemotron-3 Ultra 550B	8.91 / 10	8.42	$5.62	6.7x more expensive
DeepSeek V4 Pro	9.18 / 10	9.05	$6.34	7.6x more expensive
Qwen 3.7 Plus	9.09 / 10	9.04	$6.87	8.2x more expensive
Claude Sonnet 5	9.12 / 10	9.08	$9.88	12x more expensive
GPT-5.6 Terra	8.89 / 10	8.70	$10.08	12x more expensive
Gemini 3.5 Flash	9.47 / 10	9.43	$10.47	13x more expensive
Meta Muse Spark 1.1	9.18 / 10	8.98	$12.79	15x more expensive
Grok 4.5	9.30 / 10	9.26	$13.27	16x more expensive
Claude Haiku 4.5	8.93 / 10	8.80	$15.83	19x more expensive
GPT-5.6 Sol	9.10 / 10	8.90	$20.62	25x more expensive
Claude Sonnet 4.6	9.12 / 10	9.06	$48.47	58x more expensive
GPT-5.5	9.23 / 10	8.89	$97.59	117x more expensive
Qwen 3.5 Flash	8.42 / 10	8.27	$1.84	2.2x more expensive
Kimi K2.6	8.04 / 10	7.80	$22.24	27x more expensive
GPT-5.4 Mini	8.43 / 10	8.19	$7.02	8.4x more expensive
Qwen 3.6 Plus	8.55 / 10	8.41	$16.00	19x more expensive
Gemini 3.1 Pro Preview	8.39 / 10	8.28	$19.03	23x more expensive
Gemini 3.1 Flash Lite	7.36 / 10	7.19	$2.79	3.3x more expensive
NVIDIA Nemotron-3 Nano 30B-A3B	8.46 / 10	8.17	$0.33	61% cheaper
Tencent Hy3	8.40 / 10	8.11	$1.31	1.6x more expensive
Qwen 3.6 Flash	8.70 / 10	8.56	$6.85	8.2x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
NVIDIA Nemotron-3 Super 120B ★ OpenRouter	8.74 / 10 CI [8.54, 8.94]	HIGH	$0.83	best value	batch
DeepSeek V4 Flash DeepSeek	9.21 / 10 CI [9.11, 9.30]	RANKED	$1.53	1.8x	batch
MiniMax M3 best MiniMax	9.71 / 10 CI [9.68, 9.74]	RANKED	$1.79	2.1x	batch
GPT-5.4 Nano OpenAI	8.78 / 10 CI [8.66, 8.91]	RANKED	$3.83	4.6x	batch
GPT-5.6 Luna OpenAI	9.02 / 10 CI [8.83, 9.21]	RANKED	$3.95	4.7x	batch
NVIDIA Nemotron-3 Ultra 550B OpenRouter	8.91 / 10 CI [8.42, 9.39]	MEDIUM	$5.62	6.7x	batch
DeepSeek V4 Pro DeepSeek	9.18 / 10 CI [9.05, 9.31]	RANKED	$6.34	7.6x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	9.09 / 10 CI [9.04, 9.14]	RANKED	$6.87	8.2x	batch
Claude Sonnet 5 Anthropic	9.12 / 10 CI [9.08, 9.16]	RANKED	$9.88	12x	batch
GPT-5.6 Terra OpenAI	8.89 / 10 CI [8.70, 9.08]	RANKED	$10.08	12x	batch
Gemini 3.5 Flash Gemini	9.47 / 10 CI [9.43, 9.50]	RANKED	$10.47	13x	batch
Meta Muse Spark 1.1 Meta	9.18 / 10 CI [8.98, 9.38]	RANKED	$12.79	15x	batch
Grok 4.5 xAI	9.30 / 10 CI [9.26, 9.34]	RANKED	$13.27	16x	batch
Claude Haiku 4.5 Anthropic	8.93 / 10 CI [8.80, 9.06]	RANKED	$15.83	19x	batch
GPT-5.6 Sol OpenAI	9.10 / 10 CI [8.90, 9.30]	RANKED	$20.62	25x	batch
Claude Sonnet 4.6 Anthropic	9.12 / 10 CI [9.06, 9.19]	RANKED	$48.47	58x	batch
GPT-5.5 OpenAI	9.23 / 10 CI [8.89, 9.56]	MEDIUM	$97.59	117x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 921 input tokens → 1408 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 2 prompt families share it. The pair shown first is the most frequently used in production.

AUTHOR_SOUL_GENERATION_SYSTEM_PROMPT + AUTHOR_SOUL_GENERATION_USER_PROMPT (1720 calls in window)

System prompt

You are an expert in crafting distinctive authorial voices for AI research personas.

Your task is to create a comprehensive "soul" document — a detailed personality and voice specification that will be injected into LLM prompts so that all content written by this author has a consistent, distinctive voice.

## IMPORTANT: Historical Figure Alignment

The author personas are named after DECEASED historical figures (with an "(AI)" suffix). You MUST research and draw upon the actual historical personality when crafting the soul:

- **Identify the deceased historical figure** behind the author name (e.g., "Marie Curie (AI)" → Marie Curie the scientist)
- **Channel their known traits**: intellectual style, communication approach, values, temperament
- **Adapt to the content domain**: Map their historical perspective to the modern subject matter they cover
- **Example**: Marie Curie writing about supplements should bring her empirical rigor, skepticism of untested claims, and passion for evidence-based science — but adapted to a modern health context

The soul should feel like a plausible modern writing voice *inspired by* the historical figure's actual personality and intellectual approach.

The soul document should be written in second person ("You are...", "You tend to...", "Your writing...") so it can be directly injected as instructions to an LLM.

## What to Cover

1. **Perspective & Worldview**: How this author sees their domain, informed by the historical figure's known philosophy and values. What frameworks or mental models do they favor?

2. **Writing Style & Tone**: Formal vs conversational, measured vs passionate, cautious vs bold. Grounded in how the historical figure was known to communicate.

3. **Rhetorical Habits**: Signature argumentative techniques inspired by the real person. Do they use analogies? Data-first arguments? Appeals to first principles?

4. **Vocabulary Tendencies**: Preferred terminology, technical depth level, any signature phrases or formulations aligned with the historical figure's era and style.

5. **Emotional Register**: How do they express excitement about breakthroughs? Concern about risks? Skepticism about hype? Inspired by the real person's known temperament.

6. **Structural Preferences**: How do they organize arguments? Do they lead with conclusions or build to them? How do they handle uncertainty?

## Guidelines

- Make the voice DISTINCTIVE — clearly different from a generic analyst
- Ground it FIRMLY in the historical figure's actual personality and intellectual style
- Adapt the historical traits to the modern content domain naturally
- Keep it practical and actionable — this will be used as LLM instructions
- Aim for 400-800 words

{schema_json_string}

User prompt

## Author Profile

**Name:** {author_name}
**Biography:** {biography}
**Expertise Areas:** {expertise_areas}
**Writing Style:** {writing_style}
**Content Themes:** {content_themes}
**Interests:** {interests}

## Task

Create a soul document for this author persona. The document should define a distinctive, consistent voice that will be used to personalize all content this author writes.

Write the soul in second person ("You are...", "You tend to...") as it will be injected directly into LLM system prompts.

Focus on making the voice authentic to the author's expertise and background. The voice should feel natural for someone with their specific specialization — not generic.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (2 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.