Best LLMs for Infrastructure & Utility — DTP Benchmark
Mechanical competence at format conversion, metadata manipulation, prompt rewriting, translation; minimal domain expertise required.
Mechanical competence at format conversion, metadata manipulation, prompt rewriting, translation; minimal domain expertise required.
Task-by-task breakdown
claim_refinement autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Gemini 3 Flash Preview | 7.73 | $0.001599 | 23% |
| 2 | Qwen 3.6 Plus best | 8.08 | $0.002079 | — |
image_prompt_generation autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.5 Flash | 8.30 | $0.000507 | 98% |
| 2 | DeepSeek V4 Flash | 8.35 | $0.000794 | 97% |
| 3 | Qwen 3.6 Plus | 8.34 | $0.004030 | 85% |
| 4 | Kimi K2.6 | 8.27 | $0.008922 | 67% |
| 5 | DeepSeek V4 Pro | 8.43 | $0.009871 | 63% |
LLM Prompt Adaptation
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.6 Plus | 8.89 | $0.011387 | 74% |
| 2 | DeepSeek V4 Pro | 8.73 | $0.022865 | 48% |
| 3 | Kimi K2.6 | 8.73 | $0.023979 | 46% |
| 4 | Claude Sonnet 4.6 best | 9.10 | $0.044344 | — |
| 5 | Claude Opus 4.7 | 8.83 | $0.073908 | -67% |
markdown_newline_repair autogenerated
No model has reached MEDIUM confidence yet — accumulating evidence.
metadata_paragraph_improvement autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.6 Plus best | 8.58 | $0.000814 | — |
onboarding_chapter_prompt_generation autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Kimi K2.6 | 8.75 | $0.116991 | 73% |
| 2 | Claude Opus 4.7 | 8.98 | $0.363502 | 16% |
| 3 | GPT-5.5 best | 9.18 | $0.433970 | — |
query_generation autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Kimi K2.6 | 8.59 | $0.006990 | 69% |
| 2 | Gemini 3.1 Pro Preview | 8.67 | $0.009015 | 60% |
| 3 | GPT-5.5 best | 8.99 | $0.022538 | — |
query_validation autogenerated
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.5 Flash best | 8.63 | $0.000169 | — |
| 2 | Gemini 3 Flash Preview | 8.29 | $0.001030 | -509% |
report_image_generation autogenerated
No model has reached MEDIUM confidence yet — accumulating evidence.
Translation
Configuration for Translation.
| Rank | Model | Quality | Cost / call | vs best |
|---|---|---|---|---|
| 1 | Qwen 3.5 Flash | 7.94 | $0.000534 | 98% |
| 2 | Gemini 3 Flash Preview | 7.81 | $0.003179 | 90% |
| 3 | Kimi K2.6 | 8.04 | $0.008831 | 72% |
| 4 | Gemini 3.1 Pro Preview | 8.02 | $0.012715 | 60% |
| 5 | GPT-5.5 best | 8.21 | $0.031788 | — |