Skip to content
AI

Why do model leaderboard scores collapse when the test set has never been seen in training?

82

Opportunity

Static benchmarks like MMLU carry contamination rates as high as 45%, and paraphrased or translated versions of test items survive exact-match decontamination while still inflating published scores. A model can top a leaderboard on a contaminated task and fail the same task when it is cleanly rephrased. Dynamic benchmarks that refresh tasks periodically exist but lack standardized design criteria, so results cannot be compared across them or verified as representative of the skill they claim to measure. Every capability and safety claim published on a leaderboard rests on numbers that no independent party can validate as clean.

Why it matters

Trustworthy evaluation is the prerequisite for every downstream safety and deployment decision, and the numbers on which those decisions rest are not currently trustworthy.

機会をどう評価するか

The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.

深刻度8/10

How much pain it causes when it shows up.

頻度8/10

How often people actually run into it.

ホワイトスペース8/10

How little good tooling exists for it today.

解決する価値のある問題をもっと見る