Apple’s recent AI research paper, “The Illusion of Thinking”, has been making waves for its blunt conclusion: even the most advanced Large Reasoning Models (LRMs) collapse on complex tasks.But not everyone agrees with that framing.Today, Alex Lawsen, a researcher at Open Philanthropy, published a detailed rebuttal arguing that many of Apple’s most headline-grabbing findings boil down to experimental design flaws, not fundamental reasoning limits.
The paper also credits Anthropic’s Claude Opus model as its co-author.The rebuttal: Less “illusion of thinking,” more “illusion of evaluation” Lawsen’s critique, aptly titled “The Illusion of the Illusion of Thinking,” doesn’t deny that today’s LRMs struggle with complex planning puzzles.But he argues that Apple’s paper confuses practical output constraints and flawed evaluation setups with actual reasoning failure.
Here are the three main issues Lawsen raises: Token budget limits were ignored in Apple’s interpretation: At the point where Apple claims models “collapse” on Tower of Hanoi puzzles with 8+ disks, models like Claude were already bumping up against their token output ceilings.Lawsen points to real outputs where the models explicitly state: “The pattern continues, but I’ll stop here to save tokens.” Impossible puzzles were counted as failures: Apple’s River Crossing test reportedly included unsolvable puzzle instances (for example, 6+ actor/agent pairs with a boat capacity that mathematically can’t transport everyone across the river under the given constraints).Lawsen calls attention to the fact that models were penalized for recognizing that and refusing to solve them.
Evaluation scripts didn’t distinguish between reasoning failure and output truncation:Apple used automated pipelines that judged models solely by complete, enumerated move lists, even in cases where the task would exceed the token limit.Lawsen argues that this rigid evaluation unfairly classified partial or strategic outputs as total failures.Alternative testing: Let the model write code instead To back up his point, Lawsen reran a subset of the Tower of Hanoi tests using a different format: asking models to generate a recursive Lua function that prints the solution instead of exhaustively listing all moves.
The result? Models like Claude, Gemini, and OpenAI’s o3 had no trouble producing algorithmically correct solutions for 15-disk Hanoi problems, far beyond the complexity where Apple reported zero success.Lawsen’s conclusion: When you remove artificial output constraints, LRMs seem perfectly capable of reasoning about high-complexity tasks.At least in terms of algorithm generation.
Why this debate matters At first glance, this might sound like typical AI research nitpicking.But the stakes here are bigger than that.The Apple paper has been widely cited as proof that today’s LLMs fundamentally lack scalable reasoning ability, which, as I argued here, might not have been the fairest way to frame the study in the first place.
Lawsen’s rebuttal suggests the truth may be more nuanced: yes, LLMs struggle with long-form token enumeration under current deployment constraints, but their reasoning engines may not be as brittle as the original paper implies.Or better yet, as many it implied.Of course, none of this lets LRMs off the hook.
Even Lawsen acknowledges that true algorithmic generalization remains a challenge, and his re-tests are still preliminary.He also lays out suggestions as to what future works on the subject might want to focus on: In other words, his core point is clear: before we declare reasoning dead on arrival, it might be worth double-checking the standards by which that is being measured..
You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day.Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop.Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel