Can we fix AI evaluation crisis?

As a tech journalist, I am often asked questions such as “Is DeepSeek better than ChatGPT?” and “Is the Anthropic Model any good?”. If I don’t want to turn it into a long seminar, I will usually give a diplomatic answer, “They’re both solid, but in different ways.” It is human to try to make sense of new and powerful things. This simple question, “Is this model good?” is really the everyday version a much more complex technical problem.

The way we have tried to answer this question so far is by benchmarks. These benchmarks give models a set of fixed questions to answer, and then grade them based on how many are answered correctly. These benchmarks are not always indicative of deeper abilities, just as they are not always reflective of exams such as the SAT (a college admissions test that is used by many US colleges). It seems like a new AI model is released every week. And every time, the scores are updated to show that it has improved on previous models. Everything appears to be improving on paper. In reality, it’s not that simple. As Russell Brandon explained to us in his article, just as a person who works hard for the SAT can improve their score without improving their critical thinking, models may be trained to optimize benchmark results without becoming smarter. Andrej Karpathy, an OpenAI and Tesla AI veteran, recently said that we are experiencing a crisis of evaluation. Our scoreboard for AI does not reflect what we want to measure. Benchmarks are stale due to a few main reasons. First, the industry learned to “teach for the test”training AI models in order to achieve high scores rather than improve genuinely. Second, data contamination is a widespread problem, which means that models may have seen the benchmark questions or even the answers in their training data. Lastly, many benchmarks have been maxed out. In popular tests such as SuperGLUE, the models have already achieved or exceeded 90% accuracy. Further gains are therefore more statistical noise than meaningful improvements. After that point, scores no longer provide any useful information. This is especially true for high-skill areas like coding and reasoning. There are many teams working around the globe to solve the AI evaluation crisis.

A new benchmark named LiveCodeBench Pro is one of the results. It draws problems from international algorithmic olympiads–competitions for elite high school and university programmers where participants solve challenging problems without external tools. The top AI models are currently only able to solve 53% of medium-difficult problems on their first attempt and 0% of the most difficult ones. These are tasks that human experts excel at.

Zihan Zhang, a NYU junior and North America finalist for competitive coding, was the leader of a team that developed LiveCodeBench Pro along with olympiad-medalists. They have published both the benchmark as well as a detailed analysis showing that top-tier models such as GPT o4 mini-high and Google’s Gemini 2.5 perform on a level similar to the top 10% human competitors. Zheng found a consistent pattern: AI excels in making plans and executing task, but struggles with nuanced algorithms. “It shows AI is far from matching the most talented human coders,” says Zheng.

LiveCodeBench Pro could set a new standard. But what about the ground? In the past month, researchers from several universities came together to discuss the floor. It was argued that LLM Agents should be evaluated on the basis of how risky they are, and not just their performance. In real-world, application-driven environments–especially with AI agents–unreliability, hallucinations, and brittleness are ruinous. When money or safety is at stake, one wrong move can spell disaster.

Other new attempts are being made to solve the problem. Some benchmarks like ARC AGI now keep a part of their data private to prevent AI model optimization excessively for the test. This is called “overfitting”. Yann LeCun, Meta’s Yann, has created LiveBench which is a dynamic benchmark with questions that change every six months. The goal is to evaluate the models not only on their knowledge, but also on their adaptability.

Xbench is a Chinese benchmark developed by HongShan Capital Group, formerly Sequoia China. I wrote about it just a few days ago in a news story. Xbench, originally built in 2022 — right after ChatGPT was launched — was designed as an internal tool for evaluating models for investment research. Over time, the team has expanded the system by bringing in external collaborators. It only made parts of the question set public last week.

Xbench’s dual-track design is notable, as it tries to bridge a gap between lab-based testing and real-world utility. The first track tests a model’s STEM skills and ability to conduct Chinese-language research. The second track is designed to evaluate practical usefulness, or how well a model performs in areas such as recruitment and marketing. One task requires an agent to identify 5 qualified battery engineers; another asks it to match brands with relevant creators from a list of over 800 creators.

Xbench’s team has big goals. They plan to expand their testing capabilities to include sectors such as finance, law, design, and more. They also plan to update the test sets quarterly to avoid stagnation.

I often wonder this, because a model’s hardcore reasoning doesn’t translate into a creative, fun, and informative experience. Most queries from average users won’t be rocket science. There hasn’t been much research on how to evaluate a model’s creativeness, but I would love to know which model is best for creative writing and art projects. Human preference testing is also an alternative to benchmarks. LMarena is a platform that has become increasingly popular. Users can submit questions, compare the responses of different models side-by-side, and then choose which one they prefer. This method is not without its flaws. Users will sometimes reward answers that sound more agreeable or flattering, even if they are incorrect. This can encourage models to “sweet-talk” and distort results in favor pandering.

AI Researchers are beginning to admit that the status quo in AI testing can’t continue. NYU professor Saining Xie used Finite and Infinite Games by historian James Carse to criticize the hypercompetitive nature of AI research at the recent CVPR Conference. He noted that an infinite game is open-ended, and the goal is to continue playing. In AI, however, a dominant player will often drop a major result, which triggers a wave of subsequent papers on the same narrow topic. This race-to publish culture puts researchers under enormous pressure and rewards speed, rather than depth. Short-term wins are rewarded over long-term insights. He warned that if academia chose to play a finite-game, it would lose everything.

His framing was powerful, and maybe it also applies to benchmarks. Do we have a comprehensive scoreboard to measure the quality of a model? Not really. Many dimensions–social, emotional, interdisciplinary–still evade assessment. The wave of new benchmarks suggests a shift. A healthy dose of skepticism may be necessary as the field develops. This story was originally published inThe Algorithm, our weekly AI newsletter. Sign up for our weekly newsletterto receive stories like this first.

Correction – A previous version of this article incorrectly stated that ChatGPT o4 mini-high was the top performing model in LiveCodeBench Pro.

www.aiobserver.co

More from this stream

Recomended