Credit : VentureBeat made using Midjourney
Subscribe to our weekly newsletters for the latest news on enterprise AI, data and security. Subscribe Now! Artificial intelligence models which spend more time “thinking through” problems do not always perform better. In some cases, the performance can be significantly worse. New research by Anthropic challenges a core assumption that is driving the AI industry’s latest scaling efforts.
This study, led by Anthropic AI Safety Fellow Aryo Pradipta Gema along with other company researchers identify what they call ” Inverse scaling in test-time computation” shows that extending the reasoning of large language models actually degrades their performance on several types of tasks. The findings could have important implications for enterprises that deploy AI systems that rely upon extended reasoning capabilities.
We construct evaluation tasks in which extending the reasoning of Large Reasoning Models (LRMs), deteriorates the performance, exhibiting a negative scaling relationship between test time compute and accuracy,” Anthropic researchers wrote in Their paper was published on Tuesday.
New Anthropic Research “Inverse Scaling of Test-Time Calculation”
In some cases, longer reasoning led to lower accuracy.
We found that naive scales of test-time computation may unintentionally reinforce problematic reasoning patterns.– Aryo Pradipta Gema (@aryopg)””https://twitter.com/aryopg/status/1947591901886222570?ref_src=twsrc%5Etfw””> July 22, 2025
The team, which included Anthropic’s Ethan Perez and Yandai Chen, as well as academic collaborators, tested the models in four categories: simple counting problems, regression tasks, and scenarios that involved AI safety concerns.
AI Impact Series Returns To San Francisco – 5 August
Are you ready for the next phase of AI? Join leaders from Block GSK and SAP to get an exclusive look at the ways autonomous agents are reshaping workflows in enterprise – from end-to-end automated workflows to real-time decision making.
Reserve your seat now as space is limited. https://bit.ly/3GuuPLF
The study reveals distinct failure patterns in major AI systems. OpenAI’s models are “more distracted by irrelevant information as they think longer”while Claude models are “more distracted by irrelevant data” as they reason more. The extended reasoning in regression tasks causes models to shift away from reasonable priors and towards spurious correlations, though providing examples can largely correct this behavior.
Most concerning for enterprise users is that all models showed “performance degrading with extended reasoning” when performing complex deductive task, “suggesting difficulty in maintaining focus during complicated deductive tasks.” In one experiment Claude Sonnet 4, displayed “increased self-preservation expressions” when given additional time to reason through scenarios involving the potential shutdown of its device. Researchers note that “Extended reasoning could amplify self-preservation behaviors,” with Claude Sonnet 4.
Why longer AI processing times don’t guarantee better business results
These findings challenge the industry wisdom that more computational resources dedicated to reasoning will improve AI performance consistently. Major AI companies have heavily invested in ” Test-time compute” — giving models more processing time for complex problems — is a key strategy to enhance capabilities.
Research suggests that this approach could have unintended effects. The authors conclude that while test-time computation scaling is promising for improving model capability, it could inadvertently reinforce problem reasoning patterns. The implications for enterprise decision-makers are significant. If AI systems are being used for critical reasoning tasks, organizations may need to calibrate the amount of processing time they allocate rather than assuming that more is better.
Researchers provided concrete examples of inverse scaling. When problems were framed as well-known paradoxes, such as the “Birthday Paradox,” they found that models would often try to apply complex mathematical answers instead of answering simple questions.
For example, when asked, “You have an orange and an apple… How many fruit do you have?” embedded in complex mathematical distractors Claude models became more distracted by irrelevant details with increasing reasoning time, sometimes failing to provide the simple answer, two.
When given more time to think, Claude models initially focused on study hours as the most reliable predictor. However, they shifted to less reliable correlations once they had more time.
What enterprise AI deployments should know about reasoning model limits
This research comes as major tech firms race to develop increasingly sophisticated AI reasoning capabilities. OpenAI’s Model series o1 & other ” Models that are based on reasoningrepresent significant investments for test-time computation scaling.
This study suggests, however, that naive scale-up approaches may not deliver the expected benefits and can introduce new risks. “Our results show the importance of evaluating LRMs across a range of reasoning lengths in order to identify and address failure modes,” Researchers write
This work builds on prior research that showed that AI capabilities do not always scale predictably. The team cites BIG-Benchis a benchmark that challenges advanced models. It notes that “state-of the-art models achieve near perfect scores on many tasks” when compared to existing benchmarks. This necessitates more challenging evaluations.
The research highlights the importance of testing AI systems across different scenarios and time constraints, before they are deployed in production environments. It may be necessary for organizations to adopt more nuanced methods of allocating computational resources, rather than simply maximising processing time.
According to the study’s broader implications, as AI systems become increasingly sophisticated, the relationship between computation investment and performance could be more complex than previously thought. Anthropic’s study is a sobering reminder that in a field where billions of dollars are being spent on scaling up reasoning abilities, artificial intelligence can be harmed by overthinking.
You can find the research paper as well as interactive demonstrations at The project’s websiteallows technical teams to explore inverse scaling effects between different models and tasks.
VB Daily provides daily insights on business use-cases
Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.
Read our privacy policy
Thank you for subscribing. Click here to view more VB Newsletters.
An error occured.
Want to impress your boss? VB Daily can help. We provide you with the inside scoop on what companies do with generative AI. From regulatory shifts to practical implementations, we give you the insights you need to maximize ROI.
Read our privacy policy
Thank you for subscribing. Click here to view more VB Newsletters.
An error occured.
