Methodology

What this site does

Most production LLM tasks have a quality threshold, not a quality maximum. Once a model clears the bar, additional capability is unused — you’re paying for headroom your pipeline doesn’t exercise. This site shows you which models clear your quality bar on each step of your pipeline, and ranks them by cost ascending so you can pick the right-sized model rather than reaching for the best-performing model by default.

Quality Score (the metric)

Every model on every task gets a quality score on a 0–10 scale (the internal judge rubric). On the absolute rail it comes from LLM-judge verdicts aggregated via confidence-interval (CI) gates and normalized per-judge offset so a stricter judge doesn’t sink models it rated. Scores are displayed as “X.X / 10” on individual task pages.

Confidence (how much to trust the score)

A quality score of 8.4 isn’t worth much if it could just as easily be 7.0 or 9.8 — same number, very different meaning. Confidence is our answer to “how sure are we?”, derived from how many independent LLM judges have graded the model’s outputs on this task and how closely they agree.

A polling analogy

Think of each judge as a survey respondent. The model’s true quality is what an infinite panel of judges would converge on; what we see is a sample. More respondents + tighter agreement = narrower margin of error. On a 0–10 scale:

RANKED — like a major-network political poll the day before an election: thousands of responses, the result holds to within ±2 points on a 100-point scale. We’re as sure as we can practically get. (ci_half ≤ 0.20, i.e. ±0.20 on the 0–10 scale.)
HIGH — like a restaurant rated by several hundred guests with pretty consistent reviews. Within ±3% of the truth. Order-of-finish vs similar models is stable. (ci_half ≤ 0.30, ±0.30 on the 0–10 scale.)
MEDIUM — like a small book with thirty reviews that mostly agree. Tight enough to publish — and tight enough that the qualification decision (does it clear the bar?) usually doesn’t depend on the uncertainty — but treat fine-grained rankings of two MEDIUM models with a small grain of salt. Within ±5%. (ci_half ≤ 0.50, ±0.50 on the 0–10 scale.)
LOW — like a niche restaurant with three reviews ranging from one star to five. Not enough signal yet — hidden everywhere on the site so it can’t mislead. (ci_half > 0.50.)

How it’s computed

Each cell carries a 95% confidence interval (ci_low, point estimate, ci_high). The discrete level above is set by the half-width of that interval: ci_half = 1.96 · σ / √n, where σ is judge-score dispersion and n is the sample count. Lower σ and higher n both tighten the band — that’s why more judges + more agreement = higher confidence.

Only MEDIUM-or-better cells appear in the qualifying set, regardless of the slider position. This is the same gate the internal selector uses to admit models to production traffic — the public ranking is the subset of cells the system itself trusts.

The Quality Bar (the slider)

The slider asks: how good does a model need to be to handle this task well? At 90%, any model that reaches 90% of the best-performing model’s quality on the task is good enough — the leaderboard surfaces those qualifiers sorted by cost ascending. The best-performing model itself stays on every chart as the reference point.

Relative-to-best instead of absolute, because task difficulty varies wildly. On a hard task where the best model scores 6.5, “good enough” might be 5.85; on a saturated easy task, “good enough” might be 9.5. Absolute thresholds would leave hard tasks empty and flood easy tasks. The slider position means “right-sized for this task” everywhere.

Cost (the rank key)

Within the qualifying set, rows sort ascending by observed, all-in cost per task run (displayed per 1,000 runs — single runs are sub-cent). This is not a per-token list price: it is what running the task on that model actually costs on our production workload. Each model’s cost comes from its own measured usage on the task — averaged over its recent production calls — including output verbosity, thinking/reasoning tokens, cache reads and writes, server-tool searches, and the spend on its billed failures (a model that wastes money on rejected outputs pays for its own waste). That usage shape is priced at current list rates, band-selected by the real call size, and scaled by the actual-billed-vs-list overhead we reconcile against provider invoices.

Two consequences worth being explicit about. First, the figure is not comparable to providers’ advertised $/1M rates — a model with cheap tokens that thinks at length, or answers verbosely, can cost more per run than a pricier-per-token model that answers tersely. Measuring per run is the point. Second, cache-hit rates depend on our prompt architecture and call cadence, so the number is “cost on a workload like ours” — we publish each task’s typical call shape alongside for context. Where a model has too little recent traffic on a task to measure, we fall back to a modelled estimate at the task’s typical call shape and flag it (cost_basis in the published data: observed, observed_repriced for a stale-but-real shape repriced at current rates, or modelled).

Every model row also shows its cost relative to the cheapest qualifier — the cheapest model that clears the quality bar on that task. That cheapest good-enough model is the reference (marked ★); every other model is shown as a multiple of it: 16x means it costs 16× the cheapest qualifier for no quality benefit above the bar. The cheapest model good enough for the job is the baseline; everything else is what you’d overpay.

We anchor to the cheapest qualifier — not the best-performing model — for two reasons. It answers the question the site is built around (what does the cheapest good-enough model cost, and how much more does everything else?), and it spreads the multiples across a readable range instead of bunching them near one extreme. The reference moves with the slider: raise the bar and fewer — usually pricier — models qualify, so the cheapest-qualifier baseline shifts with it. The best-performing model still appears on every chart (marked best) as the quality reference; it just no longer sets the cost baseline.

Batch vs sync pricing

The cost mode toggle switches the price each model is shown at. Sync only prices every model at standard synchronous rates. Batch if supported uses batch pricing wherever the provider offers it; a model without batch support shows its sync figure in batch mode — running it inside a batch-when-possible pipeline simply costs its sync price, so every model stays in scope and comparable. Both figures are reconciliation-adjusted: the observed usage shape priced at that tier’s list rates, scaled by the actual-billed-vs-list overhead we measure for each provider and model. So the batch-vs-sync gap you see is the discount we actually observe when reconciling provider bills for that model — a per-provider, per-model figure that can sit above or below a provider’s nominal headline batch rate, not a flat list-price discount.

Default slider position — chosen by the data

The default slider position isn’t picked by us; it’s computed each week via knee detection on the savings curve. For each candidate threshold (75, 80, 85, 90, 95, 100), we compute total pipeline savings at that threshold; the curve’s elbow — where loosening further yields diminishing returns — becomes the default. Clamped to [80, 95] to avoid extremes.

When a new best-performing model lands and the curve reshapes, the default reshapes with it. No human-tuned defaults; just the threshold where the biggest savings live this week.

Excluded data

Cells flagged excluded_from_stats (account suspension, quota exhaustion, infrastructure failures).
LOW-confidence cells (never qualify, never set anchors).
Deprecated models.
Meta-evaluation capabilities (judges scoring judges).

Judge panel

The methodology page lists the current judge panel — the LLMs whose verdicts produce quality scores — with role, calibration offset (how strict this judge is vs the panel mean), and active-since date. Only the current panel is shown; retired judges live in the snapshot record but aren’t published.

Each task’s input is sent to every active candidate model in parallel during fan-out. Every candidate’s output is shown to every panel judge. The judges’ aggregated verdicts — normalized by per-judge calibration offset — produce the quality score.

Current judge panel

Judge	Provider	Role
Gemini 3.1 Pro Preview	Gemini	primary
Gemini 3.5 Flash	Gemini	primary
DeepSeek V4 Pro	DeepSeek	primary
Claude Opus 4.7	Anthropic	primary
Claude Sonnet 4.6	Anthropic	primary
GPT-5.5	OpenAI	primary

Panel version v59 · Effective from 2026-05-22T16:43:38.738011+00:00 · reinit triggered for language-detection by cli:cli at 2026-05-11T11:39:43.227658+00:00 | promote_panel: v59 has equal-or-better coverage than v58 across all qualifying TTs; 11 TTs newly gain coverage (operator=klemencas-via-claude)