Frontier AI Models Are Getting Stumped by a Simple Children's Game

Earlier this week, researchers at Apple released a damning paper, criticizing the AI industry for vastly overstating the ability of its top AI models to reason or "think."

The team found that the models including OpenAI's o3, Anthropic's Claude 3.7, and Google's Gemini were stumped by even the simplest of puzzles. For instance, the "large reasoning models," or LRMs, consistently failed at Tower of Hanoi, a children's puzzle game that involves three pegs and a number of differently-sized disks that have to be arranged in a specific order.

The researchers found that the AI models' accuracy in the game was less than 80 percent with seven disks, and were more or less entirely stumped by puzzles involving eight disks.

They also consistently failed at Blocks World, a block-stacking puzzle, and River Crossing, a puzzle that involves moving items across a river using a boat with several constraints.

"Through extensive experimentation across diverse puzzles, we show that frontier [large reasoning models] face a complete accuracy collapse beyond certain complexities," the Apple researchers wrote.

It was an eyebrow-raising finding, highlighting how even the most sophisticated of AI models are still failing to logic their way through simple puzzles, despite being made out to be something far more sophisticated by their makers' breathless marketing.

Those approaches to selling the tech to the public have led to users anthropomorphizing AI models — or thinking of them like humans — leading to a major schism between their presumed and actual capabilities.

The findings amplify ongoing fears that current AI approaches, including

Minas Marios Kontis

Forbes 30 Under 30 entrepreneur and host of AI Greece Podcast. Founder & CEO of Univation, empowering 35,000+ students across 40+ universities with AI-driven education. Started coding at 12 with a 100k+ download transportation app.

Share this article

Twitter LinkedIn Facebook