About — The Right-Sized LLM Benchmark

What this is

An independent, transparent LLM benchmark with one job: for each step of an LLM-backed pipeline, show the cheapest model that’s actually good enough — not the best model by default.

It runs on a premise most leaderboards ignore: production tasks have a quality threshold, not a maximum. Once a model clears the bar, extra capability is headroom you pay for and never use. So instead of ranking models head-to-head in aggregate, we score them per capability and rank the qualifiers by cost.

Every score here is derived from real production workloads — the financial-analysis pipeline run by KAPUALabs LLC through DTP. Not synthetic test sets, not customer data, not self-reported model numbers.

How it works (the short version)

Each task’s input goes to every candidate model in parallel.
A panel of LLM judges grades every output; scores are normalized per-judge, so a strict judge doesn’t sink the models it rates.
A score is only published once enough judges agree to trust it (confidence-gated). Anything we’re unsure about stays hidden.
Among models that clear your quality bar, rows sort by cost ascending.
The default “good enough” threshold is recomputed weekly from where the savings actually are.

Full mechanics — quality score, confidence bands, the slider, the cost math — live on the methodology page.

Reproducibility

The system and user prompt templates for every capability are published on its task page (the “Prompt templates” section).
Evaluation rubrics are listed on each task page.
One curated example input/output pair per capability.
Snapshots are immutable; the JSON payload that drives the site is committed to git on every publish.

Prompt + rubric + model + typical input shape is everything you need to reproduce a given score.

Cadence

Weekly snapshots, Monday mornings — plus ad-hoc snapshots when a new best-in-class model lands. The public site shows the most recent five in the changelog; full history lives in our backend. A writeup of each snapshot goes out in the journal — subscribe by email.

Who runs it

KAPUALabs LLC, alongside nova.kapualabs.com, where the source workloads originate. Questions, corrections, disagreements: bench@kapualabs.com.

Legal

Independent benchmarking. Results are derived from KAPUALabs LLC’s own production workloads, not from customer data. Each model links to its provider’s terms of service.