Home Technology Open-Source Tools New study shows that simulated reasoning AI models are not yet living...

New study shows that simulated reasoning AI models are not yet living up to their claims

0
New study shows that simulated reasoning AI models are not yet living up to their claims

A screenshot from the AoPSOnline site showing the 2025 USAMO Problem 1 and its solution. Credit: AoPSOnline

USAMO is a qualifying test for the International Math Olympiad. It sets a higher bar than tests such as the American Invitational Mathematics Examination (AIME) . AIME problems may be difficult, but they require integer solutions. USAMO requires contestants to write complete mathematical proofs and be scored on correctness, completeness and clarity. This is done over a period of nine hours and two full days.

Researchers evaluated several AI reasoning on six problems from the USAMO 2025 shortly after the problems’ release to minimize the chance that the problems were used as training data for the models. These models included Qwen’s QwQ32BGoogle’s Gemini 2.0 flash thinking (Experimental), and Gemini 2.5 pro, OpenAI’s o1pro and o3mini-high and Anthropic’s Claude 3.7 Sonnet With Extended Thinking.

Matharena

Although one model, Google Gemini 2.5 Pro, scored a higher average of 10.1 points out of 42 (24%), the results showed a massive drop in performance compared to AIME benchmarks. The other models were much further behind. DeepSeek R1 scored 2.0 points, Grok 3 scored 1.8 points, Google’s Flash Thinking scored 1.8 points, Anthropic Claude 3.7 achieved 1.5 points, and Qwen’s QwQ as well as OpenAI’s o1 pro both averaged 1.2. OpenAI’s O3-mini scored the lowest average at only 0.9 points (2.1%). A total of 200 solutions were generated across all models and runs tested, but not one was given a perfect score.

Although OpenAI’s newly-released 03 and o4 mini-high were not examined in this study, benchmarks on the researchers’ MathArena website show that o3 high scored 21.73 percent and o4 mini-high scored 19.05 percent on USAMO. These results could be contaminated, as they were measured after a contest, which means that the OpenAI models of the future may have used the solutions to the contest in their training data.

Why the models failed

The researchers identified several key failure patterns in the paper. The AI outputs had logical gaps, where there was no mathematical justification, arguments based on unproven assumption, and continued to produce incorrect approaches, despite producing contradictory results.

One specific example was USAMO 2025 problem 5. This problem asked models for all positive whole numbers “k,” so that a calculation involving sums binomial coefficients to the power “k” always results in an integer. Qwen’s QwQ made a mistake on this problem: it incorrectly excluded non integer possibilities at a time when the problem statement allowed for them. This error led to the model giving an incorrect answer, despite correctly identifying the necessary conditions earlier on in its reasoning process.

www.aiobserver.co

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Exit mobile version