Why do model leaderboard scores collapse when the test set has never been seen in training?

Opportunity

Static benchmarks like MMLU carry contamination rates as high as 45%, and paraphrased or translated versions of test items survive exact-match decontamination while still inflating published scores. A model can top a leaderboard on a contaminated task and fail the same task when it is cleanly rephrased. Dynamic benchmarks that refresh tasks periodically exist but lack standardized design criteria, so results cannot be compared across them or verified as representative of the skill they claim to measure. Every capability and safety claim published on a leaderboard rests on numbers that no independent party can validate as clean.

Why it matters

Trustworthy evaluation is the prerequisite for every downstream safety and deployment decision, and the numbers on which those decisions rest are not currently trustworthy.

How I score the opportunity

The Opportunity Score is my own read, not a measurement: how much it hurts, how often it bites, and how little exists to solve it today. Higher means I think it is more worth building.

Severity8/10

How much pain it causes when it shows up.

Frequency8/10

How often people actually run into it.

Whitespace8/10

How little good tooling exists for it today.

Why do model leaderboard scores collapse when the test set has never been seen in training?

How I score the opportunity

More problems worth solving