AI Evaluation

The Benchmark Problem: Why LLM Tests Are Losing Their Meaning

Top models achieve over 95% on many classic tests – but what does this really mean? Benchmark inflation makes honest evaluations harder and more important than ever.
Published on August 13, 2025 · by Michael J. Baumann

Large Language Models (LLMs) are becoming increasingly powerful, yet paradoxically, their evaluation is becoming more difficult. We face a fundamental benchmark problem: Many established tests are practically «saturated», while new challenges reveal where even the best models still fail.

When Perfection Deceives: Current Scores of Leading Models

The latest generations achieve impressive values on saturated classics like MMLU. Here's an overview of the most important top models:

ModelSWE-bench VerifiedAIME '25 (no tools)GPQA-Diamond (no tools)
GPT-574.9%94.6%85.7%
Claude Opus 4.174.5%--
Gemini 2.5 Pro-88.0%86.4%

These high values seem convincing, but they only tell half the story. What happens when a test is essentially «solved»?

The Problem of Benchmark Saturation

Benchmark saturation occurs when most high-performing models achieve similarly high scores. The test then loses its ability to differentiate — like a school test where all students achieve 95%. It says little about who is actually better or where specific strengths lie.

This is exactly what we see today: Small prompt or setup differences determine rankings more strongly than genuine capability differences. The decision «Which model fits my task?» doesn't become easier, but more complicated.

Concrete Examples of Outdated Tests

Saturation becomes particularly evident when examining the development. The AI Index 2025 shows clear progress and convergence on multiple benchmarks, for example with SWE-bench jumping from 4.4% (2023) to 71.7% (2024).

BenchmarkStatusProblem
Classic MMLUFrontier LLMs over 90%Little differentiation in the top range
GSM8KHigh scores possibleBest-of-256 sampling reaches 97.7% (not Pass@1)
HumanEvalLargely solvedNo longer distinctive for modern coding assistance

This is why harder successors emerge like MMLU-Pro (with 10 answer options instead of 4, more prompt-stable) and variants with contamination controls like MMLU-CF. This development shows why leaderboards become more important — they can adapt, but also bring potential biases.

When Optimization Becomes Manipulation

The problem intensifies through strategic optimization that borders on manipulation. Verified examples:

Selective Benchmark Execution: OpenAI runs only 477 of 500 tasks on SWE-bench Verified («solutions did not reliably pass on our infrastructure») and excludes 23. This can artificially inflate scores — the stated value of 74.9% would extrapolate to about 71.4% on the full set.

Benchmark VariantTasksReported ScoreExtrapolated Score (500 Tasks)
OpenAI (selective)477/50074.9%~71.4%
Other Labs (complete)500/500Directly comparable-

Arena Controversies: Reports about «specially crafted, non-public variants» on leaderboards raised questions about comparability. Cohere's «Leaderboard Illusion» study shows systematic problems with selective disclosure. Arena organizers disputed parts of these allegations but tightened their rules.

Added to this is the ongoing topic of training contamination — a reason why new sets with closed test sets emerged. Such practices aren't necessarily malicious, but show how complex honest evaluation has become.

HLE: The Attempt at a «Last» Test

Humanity's Last Exam (HLE) was developed to set one more «difficult, closed» academic test before complete saturation: 2,500 tasks, ~10% image-based, ~80% short answers with exact matching, ~20% multiple-choice, curated and double-checked.

The results show clear limits of current systems: GPT-5 reaches 25.3%, Gemini 2.5 Pro 21.6% — far below human expert performance. HLE exposes typical weaknesses like fragile multi-step reasoning and lack of robustness across domains.

But even HLE will probably become significantly more solvable within a few model cycles — the pattern repeats.

What Really Matters: Understanding Model Behavior

The most important insight: Scores alone aren't decisive. Much more important is understanding how each model behaves in different situations. Which prompting strategies work? Where are the blind spots? How does the system behave under stress?

Successful AI implementations don't arise from the best benchmark result, but through a systematic approach:

  • Evaluation: Instead of relying on standardized tests, develop domain-specific probes for your use case
  • Prompting: Move beyond one-shot attempts and use clear roles, format checks, and self-verification
  • Workflows: Don't focus solely on single-model performance, but on agentic pipelines with tools
  • Quality Control: Replace aggregate scores with error labels, thresholds, and human-in-the-loop systems

Small, targeted tests often beat large benchmarks when it comes to understanding a model's behavior in your specific use case.

The Way Forward

Benchmarks remain important — but as part of a larger evaluation system. Companies that want to effectively deploy AI need their own evaluation protocols that reflect their specific challenges.

At effektiv, we develop such customized evaluation approaches: realistic, manipulation-resistant, and focused on real business outcomes instead of high scores. Because in the end, what matters isn't how well a model performs on abstract tests — but how reliably it helps you achieve your goals.

effektiv Dot