Analysis 3 min read

Benchmark Drift and the Problem with Static Evals

An empirical look at how model rankings shift across MMLU, HumanEval, and GPQA as capability frontiers move — and what this means for choosing models in production.

Benchmarks are time capsules. They measure capability relative to the difficulty that existed when they were written, not the difficulty that matters now.

This creates a predictable failure mode: a benchmark starts as a meaningful signal, models improve on it, and it gradually becomes a measure of fine-tuning budget rather than general capability. By the time the field notices, it has been leaning on a degraded signal for months.

The signal decay problem

The pattern shows up clearly in MMLU. In 2021, a score of 60% was impressive. By late 2023, frontier models were routinely above 85%. The benchmark did not become easier — the models became capable enough that it stopped being the right test.

The same arc is now visible in HumanEval. A benchmark designed to measure code synthesis ability has become so heavily optimized against that it no longer cleanly separates models in the way that matters for real tasks.

Frontier model performance on saturating benchmarks, 2021–2024
MMLU
2021 60%
2022 72%
2023 86%
2024 91%
HumanEval
2021 28%
2022 54%
2023 81%
2024 92%

Best-published score per year across models. Saturation is visible when the spread between models collapses near the ceiling.

What the chart above shows is not just that models improve — it is that the spread between models on saturated benchmarks collapses even as their real-world capability differences remain significant. The benchmark stops being a useful discriminator.

Implications for production decisions

If you are choosing a model for a production system based primarily on published benchmark numbers, you are almost certainly making the decision on stale information.

The benchmarks that are currently moving — that have not yet been saturated — are the ones worth tracking. GPQA is still meaningful. LiveCodeBench is still meaningful, in part because it is continuously updated. Needle-in-a-haystack evaluations at 100k+ context remain hard enough to be informative.

The more useful practice is to run evals on tasks that actually resemble your use case. Not as a replacement for public benchmarks, but as a supplement. The model that scores highest on MMLU is not necessarily the model that writes the best SQL for your schema.

A note on interpretation

None of this means benchmarks are useless. They are useful for tracking progress within a model family, for catching regressions, and as rough filters when you have no other information.

The error is treating them as precise measurements. They are signals with decay rates. The closer a benchmark is to saturation, the less it tells you about what you actually care about.

Building in periodic recalibration — asking whether the evals you’re running still have spread, still track the things that matter — is the practice that prevents benchmark drift from quietly corrupting your decision-making.