Best LLMs for Long-form Content Generation — DTP Benchmark
Sustained compositional skill, voice consistency, coherent extended prose.
Sustained compositional skill, voice consistency, coherent extended prose.
Task-by-task breakdown
author_soul_generation autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | DeepSeek V4 Flash | 9.26 | $0.001420 | 99% |
| 2 | DeepSeek V4 Pro | 9.16 | $0.017647 | 85% |
| 3 | Haiku 4.5 | 9.14 | $0.023960 | 80% |
| 4 | Claude Sonnet 4.6 | 9.18 | $0.071880 | 40% |
| 5 | Claude Opus 4.7 best | 9.53 | $0.119800 | — |
Claim-Referenced Analyst Writing (pooled)
Pooled TT for analyst-prose synthesis tasks that must preserve [N] workflow-global claim references: cluster_claim_synthesis, chapter_consolidation, topic_report_generation. Wired to the …
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.6 Plus | 8.53 | $0.012542 | 76% |
| 2 | Kimi K2.6 | 8.77 | $0.030644 | 42% |
| 3 | DeepSeek V4 Pro | 8.73 | $0.042515 | 19% |
| 4 | Claude Sonnet 4.6 best | 8.98 | $0.052576 | — |
| 5 | Claude Opus 4.7 | 8.82 | $0.087628 | -67% |
onboarding_chapter_generation autogenerated
No model has reached MEDIUM confidence yet — accumulating evidence.
section_generation autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.6 Plus best | 9.09 | $0.004831 | — |
| 2 | DeepSeek V4 Pro | 8.81 | $0.009403 | -95% |
| 3 | Kimi K2.6 | 9.05 | $0.010100 | -109% |
| 4 | Claude Sonnet 4.6 | 8.79 | $0.018748 | -288% |
| 5 | Claude Opus 4.7 | 9.06 | $0.031248 | -547% |
Substack Newsletter (pooled)
Pooled TT for Substack opener and summary newsletter generation. Same role/voice; opener is shorter announcement, summary is longer recap with per-article links.
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.6 Plus | 9.09 | $0.004258 | 92% |
| 2 | Kimi K2.6 | 9.21 | $0.008882 | 84% |
| 3 | Claude Opus 4.7 best | 9.51 | $0.055020 | — |
theme_generation autogenerated
No model has reached MEDIUM confidence yet — accumulating evidence.