Best LLMs for Translation

Category: Infrastructure & Utility · Rail: absolute · Typical I/O: 1334→1923 tokens

Models

Frontier on this task: DeepSeek V4 Flash at 9.86 / 10. Quality bar at 90%: 8.88.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
DeepSeek V4 Flash	9.86 / 10	9.74	$0.37	best value
Gemini 3.5 Flash	9.31 / 10	9.00	$7.70	21x more expensive
Claude Sonnet 5	8.89 / 10	8.76	$10.88	30x more expensive
MiniMax M3	8.83 / 10	8.59	$1.76	4.8x more expensive
GPT-5.5	8.67 / 10	8.48	$45.37	124x more expensive
Qwen 3.6 Plus	8.46 / 10	8.22	$13.62	37x more expensive
GPT-5.4 Mini	8.14 / 10	7.75	$24.62	67x more expensive
Kimi K2.6	8.33 / 10	8.00	$14.00	38x more expensive
Qwen 3.6 Flash	8.60 / 10	8.28	$7.22	20x more expensive
Claude Haiku 4.5	7.84 / 10	7.49	$7.59	21x more expensive
Gemini 3.1 Pro Preview	8.45 / 10	8.27	$11.64	32x more expensive
DeepSeek V4 Pro	8.49 / 10	8.15	$2.23	6.1x more expensive
Claude Opus 4.8	8.84 / 10	8.61	$26.96	74x more expensive
Gemini 3.1 Flash Lite	8.01 / 10	7.70	$2.09	5.7x more expensive
GPT-5.6 Sol	8.84 / 10	8.64	$10.79	29x more expensive
Qwen 3.5 Flash	8.59 / 10	8.33	$2.32	6.3x more expensive
Claude Sonnet 4.6	8.24 / 10	7.97	$14.79	40x more expensive
Qwen 3.7 Plus	8.86 / 10	8.71	$4.94	14x more expensive
Grok 4.5	8.69 / 10	8.55	$15.94	44x more expensive
GPT-5.6 Luna	8.60 / 10	8.32	$2.32	6.3x more expensive
Meta Muse Spark 1.1	8.73 / 10	8.46	$10.48	29x more expensive
GPT-5.6 Terra	8.72 / 10	8.49	$5.08	14x more expensive
NVIDIA Nemotron-3 Super 120B	8.63 / 10	8.37	$1.47	4x more expensive
Tencent Hy3	8.53 / 10	8.31	$1.81	4.9x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
DeepSeek V4 Flash ★ best DeepSeek	9.86 / 10 CI [9.74, 9.99]	RANKED	$0.37	best value	batch
Gemini 3.5 Flash Gemini	9.31 / 10 CI [9.00, 9.62]	MEDIUM	$7.70	21x	batch
Claude Sonnet 5 Anthropic	8.89 / 10 CI [8.76, 9.02]	RANKED	$10.88	30x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 1334 input tokens → 1923 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 3 prompt families share it. The pair shown first is the most frequently used in production.

BATCH_TRANSLATE_TO_ENGLISH_SYSTEM + BATCH_TRANSLATE_TO_ENGLISH_USER (17331 calls in window)

System prompt

You are an expert multilingual translator. You will be given a numbered list of text items in a source language and must translate each into clear, grammatically correct, natural-sounding English (or the requested target language).

Rules:
- Preserve the original meaning, nuance, and tone of each item.
- Do NOT shorten, summarize, paraphrase, or merge items. Translate each item in full.
- Do NOT add information that is not present in the source.
- Preserve hashtags, @mentions, URLs, and inline tokens (e.g., [Link]) as-is. Translate only the natural language around them.
- If an item is already in the target language, return it unchanged.
- If an item is empty, return an empty string for that index.

Output format:
- Return a JSON object with a single `items` array containing exactly one entry per input item, in the same order.
- Each entry has only two fields: an integer `index` matching the input position (0-based) and a `translated_text` string.
- The length of `items` MUST equal the length of the input list. Do not skip, drop, or reorder items.
- Do NOT include any other top-level fields.

Your final output MUST be a single, valid JSON object that conforms to the provided Pydantic schema.

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Translate the following items from {source_language} to {target_language}.

Input items (JSON array of {{"index": int, "text": str}}):
{items_json}

Requirements:
1. Return one translation per input item, preserving order and matching `index` values 0..N-1.
2. The output `items` array MUST have exactly the same length as the input list above.
3. Translate each `text` into natural, fluent {target_language}. Do not shorten, summarize, or merge items.
4. Preserve hashtags, @mentions, URLs, and inline tokens; translate only natural language.
5. If an input is already in {target_language}, return it unchanged at that index.

Generate a single, well-formed JSON object that strictly adheres to the Pydantic schema below. Schema:

The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (62 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.

JUDGE_QUALITY_SYSTEM + JUDGE_QUALITY_USER (29 calls in window)

System prompt

You are a strict evaluator of LLM outputs. Score how well the output fulfills the task on a 0.0–10.0 scale, using the task-specific rubric as the primary criterion.

The "Rubric" in the user message is authoritative: when it constrains or overrides any generic guidance, the rubric wins.

Scoring scale (0.0–10.0):
- 9.0–10.0: Exceptional — comprehensive, accurate, fully meets the task.
- 7.0–8.9: Good — meets most requirements; minor gaps.
- 5.0–6.9: Satisfactory — adequate but with notable limitations or errors.
- 3.0–4.9: Poor — significant gaps, errors, or partial failure.
- 0.0–2.9: Unacceptable — major failure, unusable output.

Use the provided reference examples (if any) to keep your scoring consistent: compare the current output's quality to those already-scored benchmarks and place it on the same scale. Reference examples may come from different models — judge the output on its own merits, using them only to calibrate the scale.

Output JSON matching the schema:
- score: float from 0.0 to 10.0.
- failure_mode: a short tag for the dominant deficiency (e.g. 'hallucination', 'schema_violation', 'truncated', 'off_topic'), or null when none.
- rationale: one to three sentences justifying the score.

User prompt

Rubric: {rubric}
Task: {task_slug}
Domain: {domain}

Input context:
{input_snippet}

Output to grade:
{output_snippet}

Reference examples (already-scored outputs for the same task — use them to keep scoring consistent):
{reference_examples}

Score the output from 0.0 to 10.0 against the rubric, comparing against the reference examples for consistency. Return JSON with score, failure_mode (or null), and rationale.