Best LLMs for Trading Recommendation — DTP Benchmark
Models
Frontier on this task: Claude Sonnet 4.6 at 8.79 / 10. Quality bar at 95%: 8.35.
point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.
Cost breakdown
| Model | Quality | Sample | Blended cost / call | Savings vs best | Mode |
|---|---|---|---|---|---|
| Kimi K2.6 Moonshot AI | 8.51 / 10 CI [8.37, 8.64] | n=100 · ranked | $0.044779 | 68% cheaper | batch |
| Claude Sonnet 4.6 best Anthropic | 8.79 / 10 CI [8.57, 9.00] | n=86 · high | $0.138282 | (anchor) | batch |
| Claude Opus 4.7 Anthropic | 8.75 / 10 CI [8.63, 8.87] | n=86 · ranked | $0.230470 | — | batch |
| GPT-5.5 OpenAI | 8.73 / 10 CI [8.58, 8.88] | n=80 · ranked | $0.273628 | — | batch |
Typical call shape for this task: 5873 input tokens → 17263 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.