
Author:Guo Xiaojing, Tencent Technology
Editor | Xu Qingyang
Top-tier AI models in the world can pass medical licensing exams, write complex code, and even outperform human experts in math competitions. However, they repeatedly struggle in the children's game Pokémon.
This attention-grabbing attempt began in February 2025, when a researcher from Anthropic launched a Twitch live stream titled "Claude Playing Pokémon Red," coinciding with the release of Claude Sonnet 3.7.
2,000 viewers flooded into the live stream. In the public chat area, the audience offered advice and encouragement to Claude, gradually transforming the live stream into a public demonstration and observation of AI capabilities.
Sonet3.7 can only be said to "know how to play" Pokémon, but "knowing how to play" does not equal "being able to win." It can get stuck for dozens of hours at key points, and also make elementary mistakes that even child players would not make.
This is not Claude's first attempt.
Earlier versions performed even more disastrously: some wandered aimlessly on the map, some got stuck in infinite loops, and many more couldn't even leave the starting village.
Even with significantly enhanced capabilities, Claude Opus 4.5 can still make baffling mistakes. Once, it circled outside the "gym" for four full days without ever entering, simply because it failed to realize that it needed to chop down a tree blocking the entrance.
Why did a children's game become a Waterloo for AI?
Because Pokémon requires precisely the abilities that today's AI most lacks: continuous reasoning in an open world without explicit instructions, remembering decisions made several hours ago, understanding implied causal relationships, and making long-term plans among hundreds of possible actions.
These tasks are effortless for an 8-year-old child, yet they represent an insurmountable chasm for AI models that claim to "surpass humans."
Does the gap in toolkits determine success or failure?
By comparison, Google's Gemini 2.5 Pro successfully completed a similarly challenging Pokémon game in May 2025. Google CEO Sundar Pichai even half-jokingly remarked in a public setting that the company had taken a step toward creating "artificial Pokémon intelligence."
However, this result cannot be simply attributed to the Gemini model itself being "smarter."
The key difference lies in the set of tools the model uses. Joel Zhang (Joel Zhang), the independent developer responsible for operating the Gemini Pokémon live stream, likens the toolset to a "Iron Man suit": the AI doesn't enter the game with bare hands, but is placed within a system that can call upon a variety of external capabilities.
Gemini's toolset offers more support, such as transcribing gameplay visuals into text, which helps compensate for the model's weaknesses in visual understanding and provides customized puzzle-solving and path-planning tools. In contrast, Claude's toolset is more minimalistic, and its attempts more directly reflect the model's actual capabilities in perception, reasoning, and execution.
Such differences are not obvious in daily tasks.
When a user makes a request to a chatbot that requires an internet search, the model automatically invokes the search tool. However, in long-term tasks like "Pokémon," differences in the available toolset become significant enough to determine success or failure.
02 Turn-based exposure of AI's "long-term memory" shortcomings
Because Pokémon employs a strict turn-based system and does not require real-time reactions, it has become an excellent "training ground" for testing AI. In each step, the AI only needs to reason based on the current screen, goal prompts, and available actions to output clear commands such as "press the A button."
This seems to be exactly the type of interaction that large language models are most proficient in.
The crux of the problem lies precisely in the "discontinuity" along the temporal dimension. Although Claude Opus 4.5 has already operated for over 500 hours and executed approximately 170,000 steps, the model is constrained by reinitialization after each operation, limiting its ability to search for clues within an extremely narrow context window. This mechanism makes it resemble an amnesiac relying on sticky notes to maintain cognition, endlessly cycling through fragmented information, and never achieving the kind of qualitative leap in experience that a real human player could accomplish through quantitative accumulation.
In domains such as chess and Go, AI systems have long surpassed humans, but these systems are highly customized for specific tasks. In contrast, general-purpose models like Gemini, Claude, and GPT frequently outperform humans in exams and programming competitions, yet they repeatedly struggle in a children's-oriented game.
This contrast itself is highly enlightening.
According to Joel Zhang, the core challenge facing AI lies in its inability to consistently pursue a single, clear objective over a long time span. "If you want an agent to accomplish real work, it can't forget what it did five minutes ago," he pointed out.
And this capability is precisely an essential prerequisite for achieving automation in cognitive labor.
Peter Whidden, an independent researcher, provided a more intuitive explanation. He once open-sourced a traditional AI-based Pokémon algorithm. "The AI knows almost everything about Pokémon," he said. "It's trained on massive amounts of human data and clearly knows the correct answers. But when it comes to the execution phase, it becomes awkward and clumsy."
In the game, this "knowing but being unable to act" gap is constantly magnified: the model may know it needs to find a certain item, but is unable to stably locate it on a 2D map; it may know it should talk to an NPC, but repeatedly fails in pixel-level movement.
Behind the Evolution of Capabilities: The "Instinct" Gap That Remains Unbridged
Nevertheless, the progress in AI is evident. Claude Opus 4.5 clearly outperforms its predecessor in self-recording and visual understanding, allowing it to advance further in the game. Gemini 3 Pro, after completing "Pokémon Blue," successfully finished the more challenging "Pokémon Crystal," without losing a single battle throughout the entire game. This is a feat that Gemini 2.5 Pro had never achieved.
Meanwhile, Anthropic's Claude Code toolset allows the model to write and run its own code, which has already been used for retro games such as RollerCoaster Tycoon, reportedly successfully managing a virtual theme park.
These cases reveal a non-intuitive reality: AI equipped with the right toolset can demonstrate extremely high efficiency in knowledge-based tasks such as software development, accounting, and legal analysis, even though they still struggle with tasks requiring real-time responses.
The Pokémon experiment also revealed another intriguing phenomenon: models trained on human data exhibit behavior patterns similar to those of humans.
In the technical report on Gemini 2.5 Pro, Google noted that the model's reasoning quality significantly decreases when the system simulates a "panic state," such as when a Pokémon is about to faint.
And when Gemini 3 Pro finally completed "Pokémon Blue," it left itself a non-essential note: "To end poetically, I will return to my original home and have one last conversation with my mother, retiring the character."
In Joel Zhang's view, this behavior was unexpected and carried a certain projection of human-like emotions.
04. The "Digital Long March" That AI Struggles to Overcome Goes Far Beyond Pokémon
"Pokémon" is not an isolated example. On the path toward achieving artificial general intelligence (AGI), developers have found that even if AI can rank highly in bar exams, it still faces insurmountable "Waterloos" when dealing with several types of complex games.
NetHack: The Abyss of Rules

This 1980s dungeon game is a "nightmare" for AI research. It features extremely high randomness and a "permadeath" mechanic. Facebook AI Research found that even if models can write code, they perform far worse than human beginners when facing "NetHack," which requires common-sense logic and long-term planning.
Minecraft: The Vanishing Sense of Purpose

Although AI can already craft wooden pickaxes and even mine for diamonds, independently "defeating the Ender Dragon" remains a fantasy. In open worlds, AI often "forgets" its original goal during resource-gathering processes that span dozens of hours, or gets completely lost in complex navigation.
StarCraft II: The Gap Between Versatility and Specialization

Although customized models have previously defeated professional players, if Claude or Gemini were to take direct control through visual instructions, they would immediately crash. General models still struggle with handling the uncertainty of "fog of war" and balancing micro-management with macro-level construction.
RollerCoaster Tycoon: The Imbalance of Micro and Macro Management

Managing a theme park requires tracking the status of thousands of visitors. Even Claude Code, with its initial management capabilities, can easily become overwhelmed when dealing with large-scale financial collapses or sudden accidents. Any single failure in reasoning could lead to the park's bankruptcy.
Elden Ring vs. Sengoku: The Chasm in Physics Feedback

This type of action-intensive feedback game is extremely unfriendly to AI. The current visual processing latency means that by the time the AI "thinks" about the boss's actions, the character is often already dead. The requirement for millisecond-level responses naturally limits the model's interactive logic.
Why Pokémon Became an AI Benchmark?
Nowadays, Pokémon is gradually becoming an informal yet highly persuasive benchmark in the field of AI evaluation.
Models from Anthropic, OpenAI, and Google have collectively attracted millions of comments on Twitch through related live streams. Google has detailed Gemini's gaming progress in technical reports, and Sundar Pichai publicly mentioned this achievement at the I/O Developer Conference. Anthropic has even set up a "Claude Playing Pokémon" demonstration area at industry conferences.
"We are a group of super tech enthusiasts," admitted David Hershey, head of Applied AI at Anthropic. But he emphasized that it's not just for entertainment.
Unlike traditional benchmarks that involve one-time, question-answering interactions, Pokémon can track a model's reasoning, decision-making, and progress toward goals over an extended period, which is closer to the complex tasks that humans hope AI can perform in the real world.
So far, AI's challenges in "Pokémon" continue. Yet it is precisely these recurring difficulties that clearly outline the capability boundaries that general artificial intelligence has yet to cross.
Contributing editor Wuji also contributed to this article.
