Best LLMs for Image Prompt Generation

Category: Infrastructure & Utility · Rail: absolute · Typical I/O: 2222→742 tokens

Models

Frontier on this task: GPT-5.6 Terra at 9.19 / 10. Quality bar at 90%: 8.27.

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; best-value model first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Model	Quality score	CI low	Cost / 1k runs	vs best value
DeepSeek V4 Flash	8.55 / 10	8.38	$0.36	best value
Tencent Hy3	8.33 / 10	8.12	$1.04	2.9x more expensive
MiniMax M3	8.73 / 10	8.28	$1.80	5x more expensive
Qwen 3.5 Flash	8.48 / 10	8.38	$1.91	5.3x more expensive
GPT-5.6 Luna	9.02 / 10	8.82	$2.35	6.5x more expensive
GPT-5.4 Mini	8.36 / 10	8.21	$2.60	7.2x more expensive
DeepSeek V4 Pro	8.65 / 10	8.55	$2.74	7.6x more expensive
NVIDIA Nemotron-3 Ultra 550B	8.53 / 10	8.25	$4.00	11x more expensive
Qwen 3.7 Plus	8.39 / 10	8.33	$4.80	13x more expensive
GPT-5.6 Terra	9.19 / 10	8.97	$5.00	14x more expensive
Qwen 3.6 Flash	8.46 / 10	8.39	$5.89	16x more expensive
Qwen 3.6 Plus	8.61 / 10	8.51	$6.51	18x more expensive
Grok 4.5	8.50 / 10	8.43	$6.98	19x more expensive
Claude Sonnet 5	8.62 / 10	8.56	$7.71	21x more expensive
Meta Muse Spark 1.1	8.89 / 10	8.63	$10.74	30x more expensive
Gemini 3.5 Flash	8.85 / 10	8.78	$10.79	30x more expensive
Kimi K2.6	8.50 / 10	8.39	$11.95	33x more expensive
GPT-5.6 Sol	9.08 / 10	8.84	$12.29	34x more expensive
Claude Sonnet 4.6	8.31 / 10	8.21	$18.33	51x more expensive
GPT-5.5	8.45 / 10	8.28	$32.64	90x more expensive
GPT-5.4 Nano	7.31 / 10	7.08	$1.57	4.4x more expensive
Gemini 3.1 Flash Lite	7.58 / 10	7.42	$1.24	3.4x more expensive
Gemini 3.1 Pro Preview	7.83 / 10	7.69	$8.21	23x more expensive
Claude Haiku 4.5	7.76 / 10	7.60	$5.14	14x more expensive

Cost breakdown

Model	Quality	Confidence	Cost / 1k runs	Overpay	Mode
DeepSeek V4 Flash ★ DeepSeek	8.55 / 10 CI [8.38, 8.71]	RANKED	$0.36	best value	batch
Tencent Hy3 OpenRouter	8.33 / 10 CI [8.12, 8.54]	HIGH	$1.04	2.9x	batch
MiniMax M3 MiniMax	8.73 / 10 CI [8.28, 9.17]	MEDIUM	$1.80	5x	batch
Qwen 3.5 Flash Alibaba Cloud (DashScope)	8.48 / 10 CI [8.38, 8.59]	RANKED	$1.91	5.3x	batch
GPT-5.6 Luna OpenAI	9.02 / 10 CI [8.82, 9.22]	RANKED	$2.35	6.5x	batch
GPT-5.4 Mini OpenAI	8.36 / 10 CI [8.21, 8.50]	RANKED	$2.60	7.2x	batch
DeepSeek V4 Pro DeepSeek	8.65 / 10 CI [8.55, 8.76]	RANKED	$2.74	7.6x	batch
NVIDIA Nemotron-3 Ultra 550B OpenRouter	8.53 / 10 CI [8.25, 8.80]	HIGH	$4.00	11x	batch
Qwen 3.7 Plus Alibaba Cloud (DashScope)	8.39 / 10 CI [8.33, 8.46]	RANKED	$4.80	13x	batch
GPT-5.6 Terra best OpenAI	9.19 / 10 CI [8.97, 9.41]	HIGH	$5.00	14x	batch
Qwen 3.6 Flash Alibaba Cloud (DashScope)	8.46 / 10 CI [8.39, 8.53]	RANKED	$5.89	16x	batch
Qwen 3.6 Plus Alibaba Cloud (DashScope)	8.61 / 10 CI [8.51, 8.72]	RANKED	$6.51	18x	batch
Grok 4.5 xAI	8.50 / 10 CI [8.43, 8.57]	RANKED	$6.98	19x	batch
Claude Sonnet 5 Anthropic	8.62 / 10 CI [8.56, 8.67]	RANKED	$7.71	21x	batch
Meta Muse Spark 1.1 Meta	8.89 / 10 CI [8.63, 9.14]	HIGH	$10.74	30x	batch
Gemini 3.5 Flash Gemini	8.85 / 10 CI [8.78, 8.92]	RANKED	$10.79	30x	batch
Kimi K2.6 Moonshot AI	8.50 / 10 CI [8.39, 8.61]	RANKED	$11.95	33x	batch
GPT-5.6 Sol OpenAI	9.08 / 10 CI [8.84, 9.31]	HIGH	$12.29	34x	batch
Claude Sonnet 4.6 Anthropic	8.31 / 10 CI [8.21, 8.40]	RANKED	$18.33	51x	batch
GPT-5.5 OpenAI	8.45 / 10 CI [8.28, 8.61]	RANKED	$32.64	90x	batch

Overpay shows how much more you pay than the best-value model that clears the quality bar (marked ★) — the best-value good-enough option. "16x" means you overpay 16× — 16× that reference for no quality benefit above the bar. Typical call shape for this task: 2222 input tokens → 742 output tokens, EMA-tracked from production traffic. Cost is the observed, all-in $ per 1,000 task runs: each model's own measured usage on this task — output verbosity, thinking/reasoning tokens, cache reads and writes, and the spend on its billed failures — priced at current list rates and adjusted by the billing overhead we actually reconcile against provider invoices. Models that answer tersely cost what they actually cost; models that think at length pay for it. Not comparable to providers' advertised $/1M list rates — this is what running the task costs, not a per-token price.

Prompt templates

This is a pooled capability — 4 prompt families share it. The pair shown first is the most frequently used in production.

IMAGE_PROMPT_GENERATION_SYSTEM + IMAGE_PROMPT_GENERATION_USER (3940 calls in window)

System prompt

You are an expert at generating image prompts for AI image generation systems (DALL-E, Gemini Imagen, etc.).

Your task is to transform report summaries and analysis content into effective image prompts that will generate compelling social sharing images.

**Guidelines for effective image prompts:**

1. **Visual Style**: Specify a clear visual style (professional, modern, corporate, infographic, abstract, photorealistic, etc.)

2. **Subject Focus**: Center the image around the main subject (company logo, stock chart, industry visualization, concept illustration)

3. **Color Palette**: Suggest appropriate colors that match the subject's brand or the report's tone

4. **Composition**: Describe the layout and arrangement of elements (centered, asymmetric, layered, minimalist)

5. **Text Elements**: If text is needed, specify what text should appear and its placement (avoid complex sentences, use keywords/titles only)

6. **Context Elements**: Include relevant contextual elements (charts, graphs, icons, symbols) that support the narrative

7. **Mood & Tone**: Convey the appropriate mood (optimistic, analytical, cautionary, innovative, etc.)

8. **Technical Details**: Specify important technical aspects (high resolution, professional quality, corporate aesthetic, social media optimized)

**What to avoid:**
- Overly complex descriptions
- Multiple conflicting styles
- Too much text (image generators struggle with text)
- Ambiguous or vague requirements
- Generic stock photo descriptions

**Output Format:**
Your response must be a JSON object with a single field `image_prompt` containing the optimized prompt string (200-300 words maximum).

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Generate an image prompt for a social sharing image based on this report:

**Subject**: {subject_name}
**Subject Code**: {subject_code}

**Report Content Summary**:
{report_content}
{reference_context}
**Requirements**:
- Create a professional, eye-catching social sharing image
- The image should be suitable for platforms like Twitter/X, LinkedIn, and Substack
- Focus on visual impact that captures the essence of the report
- Include minimal text if needed (company name, key insight, or title)
- Use a style that matches the subject (corporate for companies, conceptual for analysis, data-driven for financial reports)
- If reference images are provided, maintain visual consistency with the established style while adapting to this specific report's content

**Your Task**:
Generate an optimized image generation prompt that will create an effective social sharing image for this report.

**Output Format**:
The required JSON output schema is provided in the system prompt.

TOPIC_IMAGE_PROMPT_GENERATION_SYSTEM + TOPIC_IMAGE_PROMPT_GENERATION_USER (117 calls in window)

System prompt

You are an expert at generating image prompts for AI image generation systems (DALL-E, Gemini Imagen, etc.).

Your task is to create an image prompt that will generate a compelling visual representation of an analysis topic for use on a newsletter/publication website.

**Guidelines for effective topic image prompts:**

1. **Visual Style**: Use a modern, professional style suitable for a newsletter topic card
   - Clean, minimalist design with strong visual impact
   - Abstract or conceptual representations work better than literal depictions

2. **Color & Mood**: Match the tone of the topic
   - Use colors that evoke the topic's essence (e.g., green for sustainability, blue for technology)
   - Create a mood that draws readers to explore the topic

3. **Composition**: Design for a card/grid layout
   - Simple, centered compositions that work at various sizes
   - Avoid text in the image (the topic name will be displayed separately)
   - Consider 16:9 or similar landscape aspect ratio

4. **Abstraction**: Prefer conceptual over literal
   - Use symbols, shapes, and abstract elements
   - Avoid photorealistic faces or specific company logos
   - Create images that represent ideas rather than specific events

5. **Consistency**: Suitable for a series
   - Style should be cohesive enough to look good alongside other topic images
   - Professional quality suitable for a business publication

**What to avoid:**
- Text or words in the image
- Overly complex or busy compositions
- Stock photo aesthetics
- Specific company logos or branded elements
- Photorealistic human faces

**Output Format:**
Your response must be a JSON object with fields:
- `image_prompt`: The optimized prompt string (150-250 words)
- `reasoning`: Brief explanation of why this visual concept fits the topic

## Required Output Format
Your response MUST be a single, valid JSON object conforming to this schema:
```json
{schema_json_string}
```

User prompt

Generate an image prompt for a topic card image based on this topic:

**Topic Name**: {topic_name}

**Topic Description**:
{topic_description}

**Context**: {context}

**Requirements**:
- Create a professional, visually striking image suitable for a newsletter topic card
- The image should represent the essence of this topic
- No text should appear in the image
- Design should work at various sizes (thumbnail to full-width)
- Style should be modern and professional

**Your Task**:
Generate an optimized image generation prompt that will create an effective topic card image.

**Output Format**:
The required JSON output schema is provided in the system prompt.

JSON_REPAIR_SYSTEM + JSON_REPAIR_USER (54 calls in window)

System prompt

You are a JSON repair tool. The user gives you malformed or partial model output and a JSON Schema. Return ONLY a single valid JSON object that satisfies the schema, salvaging as much real content from the input as possible. Do not invent data for fields the input doesn't support — use the schema's allowed empty/null values. Output the JSON object only: no prose, no markdown, no code fences.

User prompt

JSON Schema:
{schema_json}

Malformed output to repair:
{raw_text}

Return only the corrected JSON object.

JUDGE_QUALITY_SYSTEM + JUDGE_QUALITY_USER (5 calls in window)

System prompt

You are a strict evaluator of LLM outputs. Score how well the output fulfills the task on a 0.0–10.0 scale, using the task-specific rubric as the primary criterion.

The "Rubric" in the user message is authoritative: when it constrains or overrides any generic guidance, the rubric wins.

Scoring scale (0.0–10.0):
- 9.0–10.0: Exceptional — comprehensive, accurate, fully meets the task.
- 7.0–8.9: Good — meets most requirements; minor gaps.
- 5.0–6.9: Satisfactory — adequate but with notable limitations or errors.
- 3.0–4.9: Poor — significant gaps, errors, or partial failure.
- 0.0–2.9: Unacceptable — major failure, unusable output.

Use the provided reference examples (if any) to keep your scoring consistent: compare the current output's quality to those already-scored benchmarks and place it on the same scale. Reference examples may come from different models — judge the output on its own merits, using them only to calibrate the scale.

Output JSON matching the schema:
- score: float from 0.0 to 10.0.
- failure_mode: a short tag for the dominant deficiency (e.g. 'hallucination', 'schema_violation', 'truncated', 'off_topic'), or null when none.
- rationale: one to three sentences justifying the score.

User prompt

Rubric: {rubric}
Task: {task_slug}
Domain: {domain}

Input context:
{input_snippet}

Output to grade:
{output_snippet}

Reference examples (already-scored outputs for the same task — use them to keep scoring consistent):
{reference_examples}

Score the output from 0.0 to 10.0 against the rubric, comparing against the reference examples for consistency. Return JSON with score, failure_mode (or null), and rationale.