Cost mode:

Category: Relevance, Classification & Matching · Rail: absolute · Typical I/O: 87232→940 tokens

Models

Frontier on this task: Qwen 3.5 Flash at 7.30 / 10. Quality bar at 95%: 6.94.

024681095% barQwen 3.5 Flash$0.002861/call0% cheaperQwen 3.6 Plus$0.030183/call-955% cheaperHaiku 4.5$0.045966/call-1507% cheaperClaude Opus 4.7$0.229830/call-7933% cheaperClaude Sonnet 4.6$0.137898/call-4720% cheaperDeepSeek V4 Flash$0.012476/call-336% cheaperDeepSeek V4 Pro$0.155055/call-5320% cheaperGemini 3.1 Flash Lite$0.011609/call-306% cheaperGemini 3.1 Pro Preview$0.092872/call-3146% cheaperMiniMax M2.5$0.027298/call-854% cheaperKimi K2.6$0.051978/call-1717% cheaperGPT-5.4 mini$0.034827/call-1117% cheaperGPT-5.4 nano$0.009311/call-225% cheaperGPT-5.5$0.232180/call-8015% cheaper

point-estimate floor (CI low) · upper CI (less certain) · Bars sorted by blended cost; cheapest qualifier first. Greyed rows are MEDIUM+ models whose point estimate clears the bar but whose CI low does not.

Cost breakdown

ModelQualitySampleBlended cost / callSavings vs bestMode
Qwen 3.5 Flash best Alibaba Cloud (DashScope)7.30 / 10 CI [7.00, 7.60]n=100 · high$0.002861(anchor)sync

Typical call shape for this task: 87232 input tokens → 940 output tokens, EMA-tracked from production traffic. Blended cost = (in × in_price + out × out_price), rounded to 6 decimals.

Evaluation rubric

The output under evaluation is a batch of {"id", "relevance_score"}
entries scoring how well X-com (Twitter) posts match a given scoring
context (a topic name, a topic description, and a topic report summary).

Judge ONLY the correctness of the relevance assignments that were
delivered. For each entry present in the output, the question is: is
this relevance_score a defensible reflection of how well that post's
content matches the scoring context?

Apply these correctness checks:
- Clearly on-topic posts should carry high scores; clearly off-topic
  posts should carry low scores. Wrong-direction assignments are the
  primary quality failure.
- Scores should discriminate between posts that visibly differ in
  topical fit. A run of identical scores across plainly-different posts
  is degenerate output, not calibrated judgement.
- Score magnitudes should be internally consistent within the batch —
  posts the model rated similarly should be similarly relevant on
  inspection.

Do NOT consider response completeness. Whether the output covers every
input post is OUT OF SCOPE for this evaluation. The production pipeline
reschedules any unscored items to a different model, so partial output
is a routing outcome — not a quality defect. Grade only the entries
that ARE present, each on its own merit; treat missing entries as if
they never belonged to this LPL.

Also ignore cosmetic details that don't change the assignment's
correctness: decimal precision, JSON field ordering, whitespace,
phrasing of any optional reasoning field.

Penalise:
- Assignments that are demonstrably wrong (high score on a clearly
  off-topic post, low score on a clearly on-topic post).
- Degenerate batches: all-zero, all-one, or all-identical scores when
  the input posts visibly differ in topical fit.
- Outputs where the score-to-content relationship is random or absent.

Reward:
- Correct directionality and meaningful discrimination across the batch,
  even when the batch is small.