Assessing the Mathematical Capabilities of Large Language Models

In George Orwell’s dystopian classic 1984, the phrase “two plus two equals five” symbolizes distorted truth and blind allegiance. Similarly, despite their advanced language skills, contemporary large language models (LLMs) often struggle with basic arithmetic accuracy.

Why AI Struggles with Math Despite Language Prowess

Although AI systems are trained to provide correct answers and can recognize cultural references-such as the Orwellian “2 + 2 = 5” as a symbol of ideological control-they frequently falter when performing reliable calculations. This inconsistency highlights a gap between natural language understanding and precise mathematical reasoning.

Introducing ORCA: A New Benchmark for AI Mathematical Reasoning

To better evaluate AI’s computational skills, researchers from Omni Calculator in Poland, alongside academic collaborators in France and Germany, developed the ORCA (Omni Research on Calculation for AI) benchmark. This test challenges AI models with diverse math problems expressed in natural language, spanning technical and scientific disciplines.

Performance of Leading AI Models on ORCA

When tested, several prominent LLMs-including ChatGPT-5, Gemini 2.5 Flash, Grok 4, DeepSeek V3.2, and Claude Sonnet 4.5-demonstrated limited success. Gemini 2.5 Flash led with an accuracy of 63%, closely followed by Grok 4 at 62.8%, while others scored significantly lower. Notably, even the most advanced models failed nearly half of the deterministic reasoning tasks presented.

Limitations of Existing Math Benchmarks

Other popular benchmarks like GSM8K and MATH-500 have reported high AI scores, sometimes exceeding 95%. However, these results can be misleading. Many datasets used in these tests have been incorporated into the training data of AI models, akin to students having prior access to exam answers. This overlap inflates performance metrics and does not accurately reflect true reasoning ability.

Insights from Recent Research

According to the ORCA research team-Claudia Herambourg, Dawid Siuda, Julia Kopczynska, Wojciech Sassi, and Joanna Smietanska Nowak-while models like OpenAI’s GPT-4 perform well on traditional benchmarks, they still exhibit frequent errors in logic and arithmetic. Supporting this, data from Oxford University’s Our World in Data project (April 2024) shows AI math reasoning scores averaging -7.44 relative to a human baseline of zero, underscoring the ongoing challenges.

Variability in AI Math Performance Across Domains

Performance inconsistencies are evident across different scientific fields. For example, DeepSeek V3.2 excelled in Math and Unit Conversions with a 74.1% success rate but struggled in Biology and Chemistry (10.5%) and Physics (31.3%). Claude Sonnet 4.5 consistently underperformed, never surpassing 65% accuracy in any category.

Example Problem Highlighting AI Challenges

Consider this engineering problem from the ORCA benchmark:

Prompt: You have 7 blue LEDs (3.6V) connected in parallel with a resistor, powered by 12V and a current of 5 mA. What is the power dissipation in the resistor (in mW)?
Expected answer: 42 mW
Claude Sonnet 4.5 response: 294 mW (incorrect), also provided 42 mW as an alternative.

The model’s confusion stemmed from ambiguity about whether the 5 mA current applied per LED or to the entire circuit, illustrating how AI can misinterpret problem context and produce divergent answers.

Conclusion: The Road Ahead for AI Mathematical Reasoning

Current AI benchmarks reveal that natural language proficiency does not guarantee dependable computational reasoning. As AI models continue to evolve, developing rigorous, unbiased evaluation tools like ORCA is essential to accurately measure and improve their mathematical capabilities. Until then, users should remain cautious about relying on AI for precise calculations.

Keywords: AI math reasoning, large language models, ORCA benchmark, AI arithmetic accuracy, AI computational reliability

ORCA shows that AI is bad at math

Assessing the Mathematical Capabilities of Large Language Models

Why AI Struggles with Math Despite Language Prowess

Introducing ORCA: A New Benchmark for AI Mathematical Reasoning

Performance of Leading AI Models on ORCA

Limitations of Existing Math Benchmarks

Insights from Recent Research

Variability in AI Math Performance Across Domains

Example Problem Highlighting AI Challenges

Conclusion: The Road Ahead for AI Mathematical Reasoning

African startups have $60B in return. How will they do it?

Google Launches New AI Scam detection in Circle to Search, Google...

Black Friday deals under 50 dollars: Apple AirTags Legos Ugreen chargers...

Google rolling out Gemini 3 Deep Think for AI Ultra

Recomended

African startups have $60B in return. How will they do it?

Google Launches New AI Scam detection in Circle to Search, Google Lens and Google Lens

Black Friday deals under 50 dollars: Apple AirTags Legos Ugreen chargers Blink cameras and other items

Google rolling out Gemini 3 Deep Think for AI Ultra

OpenAI says ChatGPT can save the average worker an hour per day

OpenAI boasts enterprise win days after internal ‘code red’ on Google threat