What this site does

Most production LLM tasks have a quality threshold, not a quality maximum. Once a model clears the bar, additional capability is unused — you’re paying for headroom your pipeline doesn’t exercise. This site shows you which models clear your quality bar on each step of your pipeline, and ranks them by cost ascending so you can pick the right-sized model rather than reaching for the best-performing model by default.

Quality Score (the metric)

Every model on every task gets a quality score on a 0–10 scale (the internal judge rubric). On the absolute rail it comes from LLM-judge verdicts aggregated via Phase 1.6 CI gates and normalized per-judge offset so a stricter judge doesn’t sink models it rated. Scores are displayed as “X.X / 10” on individual task pages.

Confidence (how much to trust the score)

Each cell carries a confidence interval — ci_low, point estimate, ci_high — and a discrete confidence level: LOW, MEDIUM, HIGH, or RANKED, derived from CI width and sample count.

Only MEDIUM-or-better cells appear in the qualifying set, regardless of the slider position. LOW cells are listed but greyed and excluded from category averages and the composite. This is the same gate the internal selector uses to admit models to production traffic — the public ranking is the subset of cells the system itself trusts.

The Quality Bar (the slider)

The slider asks: how good does a model need to be to handle this task well? At 90%, any model that reaches 90% of the best-performing model’s quality on the task is good enough — the leaderboard surfaces those qualifiers sorted by cost ascending. The best-performing model itself stays on every chart as the reference point.

Relative-to-best instead of absolute, because task difficulty varies wildly. On a hard task where the best model scores 6.5, “good enough” might be 5.85; on a saturated easy task, “good enough” might be 9.5. Absolute thresholds would leave hard tasks empty and flood easy tasks. The slider position means “right-sized for this task” everywhere.

Cost and Savings (the rank key)

Within the qualifying set, rows sort ascending by blended cost per call — computed from the typical input + output token shape we see for each task in production traffic, multiplied by per-model pricing.

Every model row also shows its savings vs the best-performing model: how much cheaper the model is than the best on this task, in percent and absolute terms. The same quality you actually need, N% cheaper.

For the composite “single-model picker” on the landing page: this model runs your whole pipeline at the chosen bar for $X, vs $Y if you used the best-performing model on everything — $Z saved.

Default slider position — chosen by the data

The default slider position isn’t picked by us; it’s computed each week via knee detection on the savings curve. For each candidate threshold (75, 80, 85, 90, 95, 100), we compute total pipeline savings at that threshold; the curve’s elbow — where loosening further yields diminishing returns — becomes the default. Clamped to [80, 95] to avoid extremes.

When a new best-performing model lands and the curve reshapes, the default reshapes with it. No human-tuned defaults; just the threshold where the biggest savings live this week.

Excluded data

  • Cells flagged excluded_from_stats (account suspension, quota exhaustion, infrastructure failures).
  • LOW-confidence cells (never qualify, never set anchors).
  • Deprecated models.
  • Meta-evaluation task types (judges scoring judges).

Judge panel

The methodology page lists the current judge panel — the LLMs whose verdicts produce quality scores — with role, calibration offset (how strict this judge is vs the panel mean), and active-since date. Only the current panel is shown; retired judges live in the snapshot record but aren’t published.

Each task’s input is sent to every active candidate model in parallel during fan-out. Every candidate’s output is shown to every panel judge. The judges’ aggregated verdicts — normalized by per-judge calibration offset — produce the quality score.

Current judge panel

JudgeProviderRole
5 miniOpenAIprimary
Gemini 3 Flash PreviewGeminiprimary
Claude Sonnet 4.6Anthropicprimary
GPT-5.4OpenAIprimary
DeepSeek V4 FlashDeepSeekprimary
DeepSeek V4 ProDeepSeekprimary
Kimi K2.6Moonshot AIprimary
GPT-5.5OpenAIprimary
GPT-5.4 miniOpenAIprimary
GPT-5.4 nanoOpenAIprimary
Qwen 3.6 PlusAlibaba Cloud (DashScope)primary
Qwen 3.5 FlashAlibaba Cloud (DashScope)primary

Panel version v58 · Effective from 2026-05-11T02:34:09.883849+00:00 · TTAL transition: 56, Claude Opus 4.7 (Anthropic) - Effective: 2026-05-09 active True -> False