Best LLMs for Relevance, Classification & Matching — DTP Benchmark
Semantic similarity judgment: does this thing belong in that bucket / match that target?
Semantic similarity judgment: does this thing belong in that bucket / match that target?
Task-by-task breakdown
at_content_domain_suggest autogenerated
No model has reached MEDIUM confidence yet — accumulating evidence.
Author Matching
Matches content to fictional authors or creates new author personas
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | DeepSeek V4 Flash | 8.63 | $0.001981 | 92% |
| 2 | Kimi K2.6 | 8.81 | $0.013565 | 45% |
| 3 | Gemini 3.1 Pro Preview | 8.59 | $0.014381 | 42% |
| 4 | DeepSeek V4 Pro best | 8.94 | $0.024626 | — |
| 5 | GPT-5.5 | 8.57 | $0.035952 | -46% |
author_living_check autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | DeepSeek V4 Flash | 8.87 | $0.000688 | 92% |
| 2 | Kimi K2.6 | 8.88 | $0.007413 | 13% |
| 3 | DeepSeek V4 Pro best | 9.31 | $0.008554 | — |
| 4 | Gemini 3.1 Pro Preview | 8.87 | $0.020280 | -137% |
| 5 | GPT-5.5 | 8.95 | $0.050700 | -493% |
Language Detection
Configuration for Language Detection.
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | GPT-5.4 nano | 9.98 | $0.000046 | 59% |
| 2 | DeepSeek V4 Flash | 9.94 | $0.000050 | 55% |
| 3 | Gemini 3.1 Flash Lite | 9.99 | $0.000056 | 50% |
| 4 | Gemini 3 Flash Preview best | 10.01 | $0.000112 | — |
| 5 | MiniMax M2.5 | 10.00 | $0.000121 | -8% |
Relevance Scoring (POST)
Scores RetrievedContent against synthesis capability description (stage 40 relevance_analysis). Split from pooled relevance_scoring on 2026-05-17 to remove inter-family σ inflation. …
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview best | 5.69 | $0.004000 | — |
Relevance Scoring (Topic Report)
Scores TOPIC_REPORT PartialSyntheses against analysis template (stage 132) and report chapters (stage 134). Split from pooled relevance_scoring on 2026-05-17. …
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.5 Flash best | 8.48 | $0.000160 | — |
| 2 | MiniMax M2.5 | 8.44 | $0.000900 | -462% |
Relevance Scoring (X Post)
Scores batched X-com posts against synthesis capability (x_post_relevance stage). Split from pooled relevance_scoring on 2026-05-17. GENERIC_RELEVANCE_SCORE_{SYSTEM,USER}_PROMPT, batched input (≥20k …
No model has reached MEDIUM confidence yet — accumulating evidence.
subreddit_selection autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | GPT-5.5 best | 5.69 | $0.021258 | — |
subreddit_vetting autogenerated
No model has reached MEDIUM confidence yet — accumulating evidence.
topic_client_matching autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | DeepSeek V4 Flash best | 8.24 | $0.003086 | — |
| 2 | DeepSeek V4 Pro | 8.14 | $0.038350 | -1143% |
| 3 | Gemini 3.1 Pro Preview | 7.88 | $0.043936 | -1324% |
| 4 | GPT-5.5 | 8.08 | $0.109840 | -3459% |
vetted_site_selection autogenerated
No model has reached MEDIUM confidence yet — accumulating evidence.
x_post_selection autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.5 Flash | 8.38 | $0.000103 | 99% |
| 2 | DeepSeek V4 Flash | 8.43 | $0.000430 | 97% |
| 3 | Gemini 3.1 Flash Lite | 8.22 | $0.000820 | 95% |
| 4 | Qwen 3.6 Plus | 8.37 | $0.001066 | 94% |
| 5 | Gemini 3 Flash Preview | 8.32 | $0.001640 | 90% |