DTP LLM Benchmark launches

We’re launching the DTP LLM Benchmark — independent rankings of leading LLMs on real production tasks from our financial-analyst pipeline at Nova.

The premise is simple: most LLM tasks have a quality threshold, not a quality maximum. The best-performing model is overkill for most steps. We show you which models actually clear your quality bar on each task, and rank them by cost ascending. The cheapest qualifier wins — not the highest-scoring one.

Today’s snapshot covers 51 task types across 8 capability categories, ranked across the current pool of leading models. Set the slider on the landing page to your quality tolerance; the leaderboard reorders to show you the right-sized model and the savings versus running the best-performing model on everything.

More — methodology deep dives, weekly snapshot writeups, judge mechanics — will land in the journal as we ship.

Read the methodology →