What this is

An independent, transparent LLM benchmark scoped to the work of building LLM-backed business processes. Every score on this site is derived from production workloads run by Capua Labs through DTP — not from synthetic test sets, not from customer data, not from self-reported model numbers.

Methodology

Full methodology lives at /methodology/ — what the quality score is, how confidence is gated, what fan-out and judging means, why the slider default moves week to week.

Reproducibility

  • All template prompts are published at /prompts/.
  • All evaluation rubrics are listed on each task page.
  • One curated example input/output pair per task type.
  • Snapshots are immutable; the JSON payload that drives the site is committed to git on every publish.

If you want to reproduce a specific score on a specific model, the template prompt + the rubric + the model + the typical input shape gives you what you need.

Cadence

Weekly snapshots, Monday mornings. Ad-hoc snapshots when a new best-in-class model lands. The full snapshot history lives in our backend; the public site shows the most recent 5 in /changelog/.

Contact

This is run by Capua Labs alongside nova.kapualabs.com, where the source workloads originate. Email: bench@kapualabs.com.

Independent benchmarking. Results derived from Capua Labs’ own production workloads, not from customer data. Models are linked to their providers' terms of service.