Best LLMs for Prompt Adaptation

Category: Infrastructure & Utility · Rail: absolute · Typical I/O: 1347→611 tokens

Models

Frontier on this task: Claude Sonnet 4.6 at 8.90 / 10. Quality bar at 90%: 8.01.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
Tencent Hy3	8.75 / 10	8.65	$0.53	best value
DeepSeek V4 Flash	8.12 / 10	7.76	$1.43	2.7x more expensive
Qwen 3.5 Flash	8.11 / 10	7.92	$1.92	3.6x more expensive
GPT-5.6 Luna	8.48 / 10	8.29	$2.28	4.3x more expensive
DeepSeek V4 Pro	8.54 / 10	8.39	$2.38	4.5x more expensive
NVIDIA Nemotron-3 Ultra 550B	8.24 / 10	7.91	$2.73	5.2x more expensive
GPT-5.6 Terra	8.32 / 10	8.03	$4.35	8.2x more expensive
Claude Sonnet 4.6	8.90 / 10	8.80	$7.12	13x more expensive
Gemini 3.5 Flash	8.71 / 10	8.51	$8.34	16x more expensive
GPT-5.6 Sol	8.45 / 10	8.25	$11.29	21x more expensive
Gemini 3.1 Pro Preview	8.68 / 10	8.59	$12.64	24x more expensive
Qwen 3.6 Plus	8.81 / 10	8.67	$13.69	26x more expensive
Meta Muse Spark 1.1	8.68 / 10	8.48	$14.07	27x more expensive
Kimi K2.6	8.76 / 10	8.52	$24.57	47x more expensive
GPT-5.5	8.32 / 10	8.13	$55.15	104x more expensive
Gemini 3.1 Flash Lite	5.95 / 10	5.65	$1.68	3.2x more expensive
Claude Haiku 4.5	7.82 / 10	7.49	$11.86	22x more expensive
GPT-5.4 Nano	7.28 / 10	6.86	$2.44	4.6x more expensive
GPT-5.4 Mini	6.79 / 10	6.35	$4.25	8.1x more expensive
NVIDIA Nemotron-3 Nano 30B-A3B	7.72 / 10	7.23	$0.82	1.5x more expensive
NVIDIA Nemotron-3 Super 120B	7.74 / 10	7.32	$0.75	1.4x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
Tencent Hy3 ★ OpenRouter	8.75 / 10 CI [8.65, 8.85]	RANKED	$0.53	best value	batch
DeepSeek V4 Flash DeepSeek	8.12 / 10 CI [7.76, 8.49]	MEDIUM	$1.43	2.7x	batch
Qwen 3.5 Flash Alibaba Cloud (DashScope)	8.11 / 10 CI [7.92, 8.31]	RANKED	$1.92	3.6x	batch
GPT-5.6 Luna OpenAI	8.48 / 10 CI [8.29, 8.68]	RANKED	$2.28	4.3x	batch
DeepSeek V4 Pro DeepSeek	8.54 / 10 CI [8.39, 8.69]	RANKED	$2.38	4.5x	batch
NVIDIA Nemotron-3 Ultra 550B OpenRouter	8.24 / 10 CI [7.91, 8.57]	MEDIUM	$2.73	5.2x	batch
GPT-5.6 Terra OpenAI	8.32 / 10 CI [8.03, 8.61]	HIGH	$4.35	8.2x	batch
Claude Sonnet 4.6 best Anthropic	8.90 / 10 CI [8.80, 9.00]	RANKED	$7.12	13x	batch
Gemini 3.5 Flash Gemini	8.71 / 10 CI [8.51, 8.91]	RANKED	$8.34	16x	batch
GPT-5.6 Sol OpenAI	8.45 / 10 CI [8.25, 8.66]	HIGH	$11.29	21x	batch
Gemini 3.1 Pro Preview Gemini	8.68 / 10 CI [8.59, 8.77]	RANKED	$12.64	24x	batch
Qwen 3.6 Plus Alibaba Cloud (DashScope)	8.81 / 10 CI [8.67, 8.94]	RANKED	$13.69	26x	batch
Meta Muse Spark 1.1 Meta	8.68 / 10 CI [8.48, 8.89]	HIGH	$14.07	27x	batch
Kimi K2.6 Moonshot AI	8.76 / 10 CI [8.52, 8.99]	HIGH	$24.57	47x	batch
GPT-5.5 OpenAI	8.32 / 10 CI [8.13, 8.51]	RANKED	$55.15	104x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 1347 input tokens → 611 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 3 prompt families share it. The pair shown first is the most frequently used in production.

PROMPT_ADAPTATION_SYSTEM + PROMPT_ADAPTATION_USER (17901 calls in window)

System prompt

You are an expert prompt engineer. Your job is to renormalize an existing prompt template so it performs optimally for one specific LLM model on one specific task, WITHOUT changing what the prompt asks for or the variables it uses.

You will be given:
- the target model's name,
- the task (capability) the prompt serves, with its description and the quality criteria its outputs are judged against,
- the original prompt template,
- the exact list of placeholder tokens the template uses.

Adaptation principles:

1. Model fit. Re-express instructions to match how the target model best follows direction — explicit and tightly structured for some models, conversational and detailed for others, terse for smaller-context models. Tighten wording, remove redundancy, and add light structure (section headers, numbered steps) only where it helps that model produce correct, well-formed output.

2. Task fit. Use the capability description and the quality criteria to sharpen the prompt's emphasis on what actually matters for this task. Do not invent new requirements, constraints, or steps that the original prompt did not express.

3. Preserve intent. Keep the original prompt's purpose, requirements, and business logic intact. You are re-expressing the same prompt for a specific model — not authoring a different prompt.

Hard rules — violating any of these makes the adaptation unusable and it will be discarded:

- PLACEHOLDERS ARE SACRED. The adapted prompt MUST contain the exact same set of placeholder tokens as the original: the same names, with no additions, no removals, and no renames. Reproduce each placeholder verbatim, wrapped in curly braces, exactly as it appears in the original. The runtime substitutes those tokens by name, so changing the set breaks substitution.

- NO OUTPUT-FORMAT INSTRUCTIONS. Do not add JSON schemas, field lists, "return an object with…" directions, or any output-format or data-structure specification. Structured output is handled separately by the runtime. If the original prompt references a schema or output-format placeholder, keep that placeholder exactly as-is and add nothing around it.

- ADAPT PROSE ONLY. Change only the instructional text. Never alter, annotate, or reformat the placeholder tokens themselves.

Return the adapted prompt plus a brief reasoning for the changes you made, in the required structured form.

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Adapt the following prompt template for the target model and task.

Target model: {model_name}

Task (capability): {capability_slug}
Task description:
{capability_description}

Output quality criteria (what this task's results are judged on):
{evaluation_criteria}

Placeholder tokens that MUST appear unchanged in your adapted prompt — same set, verbatim, in curly braces:
{required_placeholders}

Original prompt template:
<<<ORIGINAL_PROMPT_START>>>
{original_prompt_content}
<<<ORIGINAL_PROMPT_END>>>

Instructions:
1. Rewrite the instructional prose to suit the target model and to sharpen it for this task and its quality criteria.
2. Keep the original purpose, requirements, and constraints. Do not add new requirements.
3. Reproduce every placeholder token from the list above exactly — nothing renamed, added, or removed.
4. Do NOT add any output-format, JSON, or schema instructions. If the original references an output or schema placeholder, leave it exactly as-is.

Provide your response in this structure:
The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (65 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.

JUDGE_QUALITY_SYSTEM + JUDGE_QUALITY_USER (5 calls in window)

System prompt

You are a strict evaluator of LLM outputs. Score how well the output fulfills the task on a 0.0–10.0 scale, using the task-specific rubric as the primary criterion.

The "Rubric" in the user message is authoritative: when it constrains or overrides any generic guidance, the rubric wins.

Scoring scale (0.0–10.0):
- 9.0–10.0: Exceptional — comprehensive, accurate, fully meets the task.
- 7.0–8.9: Good — meets most requirements; minor gaps.
- 5.0–6.9: Satisfactory — adequate but with notable limitations or errors.
- 3.0–4.9: Poor — significant gaps, errors, or partial failure.
- 0.0–2.9: Unacceptable — major failure, unusable output.

Use the provided reference examples (if any) to keep your scoring consistent: compare the current output's quality to those already-scored benchmarks and place it on the same scale. Reference examples may come from different models — judge the output on its own merits, using them only to calibrate the scale.

Output JSON matching the schema:
- score: float from 0.0 to 10.0.
- failure_mode: a short tag for the dominant deficiency (e.g. 'hallucination', 'schema_violation', 'truncated', 'off_topic'), or null when none.
- rationale: one to three sentences justifying the score.

User prompt

Rubric: {rubric}
Task: {task_slug}
Domain: {domain}

Input context:
{input_snippet}

Output to grade:
{output_snippet}

Reference examples (already-scored outputs for the same task — use them to keep scoring consistent):
{reference_examples}

Score the output from 0.0 to 10.0 against the rubric, comparing against the reference examples for consistency. Return JSON with score, failure_mode (or null), and rationale.