AI Evaluation

The Benchmark Problem: Why AI Tests Are Becoming Less Meaningful

AI models are becoming so good that classical benchmarks are reaching their limits. The result: manipulation, saturation, and a race for increasingly difficult tests. What does this mean for companies?
Published on July 25, 2025 · by Michael J. Baumann

Large Language Models (LLMs) are becoming increasingly powerful, yet their evaluation is paradoxically becoming more difficult. We're facing a fundamental benchmark problem: Many established tests are practically "exhausted," while new challenges show where even the best models still fail.

The Saturation Problem: When 95% Becomes Meaningless

The core issue: Many classic benchmarks are reaching saturation. Models like GPT-5, Claude 4, and Gemini 2.5 achieve over 90% on tests that were once considered challenging. This creates a paradoxical situation where differences between top models become increasingly difficult to measure.

Concrete examples of outdated tests:

The saturation becomes particularly evident when looking at the development. The AI Index 2025 shows clear progress and convergence on several benchmarks, for example, a jump from 4.4% (2023) to 71.7% (2024) on SWE-bench.

BenchmarkStatusProblem
Classic MMLUFrontier LLMs over 90%Hardly any differentiation in the top tier
GSM8KHigh scores possibleBest-of-256 sampling achieves 97.7% (not Pass@1)
HumanEvalLargely solvedNo longer discriminative for modern coding assistance

This is why harder successors like MMLU-Pro (with 10 answer options instead of 4, more prompt-stable) and variants with contamination controls like MMLU-CF are emerging. This development shows why leaderboards are becoming more important – they can adapt, but also bring potential biases.

When Optimization Becomes Manipulation

The problem is exacerbated by strategic optimization that borders on manipulation. Verified examples:

Selective Benchmark Execution: OpenAI runs only 477 of 500 tasks on SWE-bench Verified ("solutions did not reliably pass on our infrastructure") and omits 23. This can artificially inflate scores – the stated value of 74.9% would reduce to about 71.4% when extrapolated to the full set.

Benchmark VariantTasksStated ScoreExtrapolated Score (500 Tasks)
OpenAI (selective)477/50074.9%~71.4%
Other Labs (complete)500/500Directly comparable-

Arena Controversies: Reports about "specially crafted, non-public variants" on leaderboards raised questions about comparability. Cohere's "Leaderboard Illusion" study shows systematic problems with selective disclosure. Arena organizers disputed parts of these allegations but tightened their rules.

Added to this is the ongoing topic of training contamination — a reason why new sets with closed test sets emerged. Such practices aren't necessarily malicious, but show how complex honest evaluation has become.

HLE: The Attempt at a "Last" Test

Humanity's Last Exam (HLE) was developed to set one more "difficult, closed" academic test before complete saturation: 2,500 tasks, ~10% image-based, ~80% short answers with exact matching, ~20% multiple-choice, curated and double-checked.

The results show clear limits of current systems: GPT-5 reaches 25.3%, Gemini 2.5 Pro 21.6% — far below human expert performance. HLE exposes typical weaknesses like fragile multi-step reasoning and lack of robustness across domains.

But even HLE will probably become significantly more solvable within a few model cycles — the pattern repeats.

What Really Matters: Understanding Model Behavior

The most important insight: Scores alone aren't decisive. Much more important is understanding how each model behaves in different situations. Which prompting strategies work? Where are the blind spots? How does the system behave under stress?

Successful AI implementations don't arise from the best benchmark result, but through a systematic approach:

  • Evaluation: Instead of relying on standardized tests, develop domain-specific probes for your use case
  • Prompting: Move beyond one-shot attempts and use clear roles, format checks, and self-verification
  • Workflows: Don't just rely on single-model performance, but on agentic pipelines with tools
  • Quality Control: Replace aggregate scores with error labels, thresholds, and human-in-the-loop systems

Small, targeted tests often beat large benchmarks when it comes to understanding how a model behaves in your specific use case.

The Way Forward

Benchmarks remain important — but as part of a larger evaluation system. Companies that want to use AI effectively need their own evaluation protocols that reflect their specific challenges.

At effektiv, we develop such tailored evaluation approaches: realistic, manipulation-resistant, and focused on real business outcomes rather than high scores. Because in the end, it's not about how well a model performs on abstract tests — but how reliably it helps you achieve your goals.

effektiv Dot