About — DTP LLM Benchmark
Independent benchmarking from real financial-analyst workloads.
What this is
An independent, transparent LLM benchmark scoped to the work of building LLM-backed business processes. Every score on this site is derived from production workloads run by Capua Labs through DTP — not from synthetic test sets, not from customer data, not from self-reported model numbers.
Methodology
Full methodology lives at /methodology/ — what the quality score is, how confidence is gated, what fan-out and judging means, why the slider default moves week to week.
Reproducibility
- All template prompts are published at /prompts/.
- All evaluation rubrics are listed on each task page.
- One curated example input/output pair per task type.
- Snapshots are immutable; the JSON payload that drives the site is committed to git on every publish.
If you want to reproduce a specific score on a specific model, the template prompt + the rubric + the model + the typical input shape gives you what you need.
Cadence
Weekly snapshots, Monday mornings. Ad-hoc snapshots when a new best-in-class model lands. The full snapshot history lives in our backend; the public site shows the most recent 5 in /changelog/.
Contact
This is run by Capua Labs alongside nova.kapualabs.com, where the source workloads originate. Email: bench@kapualabs.com.
Legal
Independent benchmarking. Results derived from Capua Labs’ own production workloads, not from customer data. Models are linked to their providers' terms of service.