Best LLMs for Translation — DTP Benchmark
Configuration for Translation.
Models
Frontier on this task: GPT-5.5 at 8.21 / 10. Quality bar at 95%: 7.79.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Qwen 3.5 Flash Alibaba Cloud (DashScope) | 7.94 / 10 CI [7.76, 8.12] | n=100 · ranked | $0.001108 | 98% cheaper | sync |
| Gemini 3 Flash Preview Gemini | 7.81 / 10 CI [7.63, 7.99] | n=92 · ranked | $0.006546 | 90% cheaper | batch |
| Kimi K2.6 Moonshot AI | 8.04 / 10 CI [7.85, 8.22] | n=100 · ranked | $0.010810 | 83% cheaper | batch |
| Gemini 3.1 Pro Preview Gemini | 8.02 / 10 CI [7.85, 8.18] | n=92 · ranked | $0.026182 | 60% cheaper | batch |
| GPT-5.5 best OpenAI | 8.21 / 10 CI [8.00, 8.41] | n=90 · high | $0.065455 | (anchor) | batch |
Typical call shape for this task: 1984 input tokens → 4033 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.