News

New paper pushes against Apple’s LLM “reasoning collapse” study

June 14, 2025

Apple’s AI research paper from “19659002” The Illusion Of Thinkinghas made waves for its blunt conclusion that even the most advanced Large Reasoning Models collapse when faced with complex tasks. Not everyone agrees.

Today Alex Lawsen published a detailed rebuttal, arguing that Apple’s most eye-catching findings are the result of experimental design flaws and not fundamental reasoning limitations. The paper credits Anthropic’s Claude Opus Model as its coauthor.

The rebuttal is: Less “illusions of thinking,” and more “illusions of evaluation”

Lawsen’s critique, appropriately titled ” The Illusion of the Illusion of Thinking (19459026)” doesn’t deny the fact that today’s LRMs have a hard time with complex planning puzzles. He argues that Apple’s article confuses flawed evaluation setups and practical output constraints with actual reasoning failure. Apple’s interpretation ignored token budget limits:
By the time Apple claimed models “collapsed” on Tower of Hanoi with 8+ disks models like Claude had already reached their token output ceilings. Lawsen cites real outputs in which the models explicitly state “The pattern continues but I’ll save tokens here.”

Unsolvable puzzles were counted a failure:
Apple’s River Crossing test included unsolvable instances (for instance, 6+ actor/agent pair with a boat that mathematically cannot transport everyone across the River under the given constraints). Lawsen draws attention to the fact models were penalized when they refused to solve them.

Evaluations scripts did not distinguish between reasoning failures and output truncation.
Apple’s automated pipelines judged models based on complete, enumerated moves lists, even when the task exceeded the token limit. Lawsen argues this rigid evaluation incorrectly classified partial or strategically-oriented outputs as failures.

Alternative testing: Let models write code instead

Lawsen, to support his point, reran a set of Tower of Hanoi test using a different format. He asked models to generate a Lua recursive function that prints the answer instead of exhaustively listing the moves.

What was the result? Models such as Claude, Gemini and OpenAI’s o3 produced algorithmically correct solutions to 15-disk Hanoi problem, which is far beyond the complexity at which Apple reported zero success.

Lawsen’s conclusion: LRMs are perfectly capable of reasoning high-complexity problems when artificial output constraints are removed. At least when it comes to algorithm generation.

Why this debate is important

This might at first sound like typical AI research nitpicking. But the stakes are much higher than that. Apple’s paper has been widely cited to prove that today’s LLMs lack scalable reasoning abilities. However, I argue here that this may not have been a fair way to frame the study.

Lawsen’s rebuttal suggests that the truth may not be as nuanced as the original paper implies: yes, LLMs are struggling with long-form tokens under current deployment constraints but their reasoning engines might not be as fragile as the original paper implied. Or, better yet, it implied what many .

Ofcourse, none of this absolves LRMs. Even Lawsen admits that algorithmic generalization is a challenge and his retests are preliminary. He also makes suggestions for future work on the topic:

Use complexity metrics to reflect the computational difficulty and not just the solution length.
The question is not whether LRMs are able to reason, but if our evaluations are able to distinguish between reasoning and typing.

His main point is that before we declare reasoning dead, it’s worth double-checking standards by which it is measured.

H/T: Fabricio Carraro. Add 9to5Mac’s Google News feed.FTC: we use auto affiliate links that earn income.More.

{{post_title}}

New paper pushes against Apple’s LLM “reasoning collapse” study

The rebuttal is: Less “illusions of thinking,” and more “illusions of evaluation”

Alternative testing: Let models write code instead

Why this debate is important

NO COMMENTS

LEAVE A REPLY

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

The rebuttal is: Less “illusions of thinking,” and more “illusions of evaluation”

Alternative testing: Let models write code instead

Why this debate is important

RELATED ARTICLES

Meta invests $15 billion in Scale AI to boost its disappointing...

Anne Wojcicki, founder of 23andMe, will take back control of the...

The Internet Archive updates its GeoCities GIF Search Engine

NO COMMENTS

LEAVE A REPLY Cancel reply

LEAVE A REPLY