Big companies are hacking AI’s top leaderboard

Benchmarks obviously play a big role in AI research, directing progress and investment decisions. Chatbot Arena is the go-to leaderboard for comparing large language models, but reveals some systematic issues that have created an uneven playing field, potentially undermining scientific integrity and reliable model evaluation.

“Overview of key insights about private testing, data access disparities, and model deprecation practices that contribute to unreliable leaderboard rankings.” (All images are from the )

How Chatbot Arena works and why it matters

Chatbot Arena has become the primary benchmark for evaluating generative AI models. Users submit prompts and judge which of two anonymous models performs better. This human-in-the-loop evaluation has gained prominence because it captures real-world usage better than static academic benchmarks.

The researchers conducted a systematic review of Chatbot Arena, analyzing data from 2 million battles covering 243 models across 42 providers from January 2024 to April 2025. Their findings reveal how certain practices have distorted the leaderboard rankings.

Number of public models vs. maximum arena score per provider. Marker size indicates total number of battles played. Higher scores correlate with higher exposure to the Arena through more models and battles.”

The Arena uses the Bradley-Terry (BT) model to rank participants based on pairwise comparisons. Unlike simple win-rate calculations, BT accounts for opponent strength and provides statistically grounded rankings. However, this approach relies on key assumptions: unbiased sampling, transitivity of rankings (if A beats B and B beats C, then A should beat C), and a sufficiently connected comparison graph.

The private testing advantage

One of the most concerning findings is an undisclosed policy that allows certain providers to test multiple model variants privately before public release, then selectively submit only their best-performing model to the leaderboard.

Number of privately-tested models per provider based on random battles from January-March 2025. Meta alone tested 27 anonymous models in March 2025.”

The researchers discovered Meta tested 27 private variants before the Llama-4 release, while Google tested 10 variants. In contrast, smaller companies like Reka had only one active private variant, and academic labs tested no private models during this period.

This “best-of-N” submission strategy violates the BT model’s unbiased sampling assumption. By simulating this effect, the researchers showed that testing just 10 variants yields approximately a 100-point increase in the maximum score identified.

Impact of the number of private variants tested on the best Expected Arena Score. More variants tested increases the likelihood of selecting models from the higher end of the performance distribution.”

The researchers validated these simulation results through real-world experiments on Chatbot Arena. They submitted two identical checkpoints of Aya-Vision-8B and found they achieved different scores (1052 vs. 1069), with four models positioned between these identical variants on the leaderboard. When testing two different variants of Aya-Vision-32B, they observed even larger score differences (1059 vs. 1097), with nine models positioned between them.

Extreme data access disparities

The research reveals substantial inequalities in access to data from Chatbot Arena battles, stemming from three main factors:

More from this stream

Recomended