The article discusses the development pathways of world models in the field of embodied intelligence. Currently, there are two prevailing approaches: Silicon Valley’s “replacement camp,” which seeks to completely replace VLA with WAM, and China’s dominant “integration camp,” which treats world models as a complementary capability to VLA. The article identifies three bubbles facing world models: overgeneralized definitions, high computational barriers, and difficulties in practical deployment. It argues that true world models should be embedded within real business闭环 to enable machines to act in the physical world, rather than merely pursuing photorealistic image generation.Author and source: A Priori Lab
From VLA to WAM: an overhyped revolution and an underestimated evolution.
Over the past six months, two major public frenzies have dominated the field of embodied intelligence. One centered on screens: from Sora to a succession of video generation models showcasing their capabilities—details like water spilling and spreading, human motion through continuous space—propelling the narrative of “AI recreating reality” to its peak, with cries of “The world model has arrived!” echoing everywhere. The other centered on tombstones: NVIDIA’s chief research scientist, Jim Fan, declared “VLA is dead, long live the world model!” with a meme depicting a WAM (World Action Model) standing before a VLA (Vision-Language-Action Model) tombstone, directly bringing the debate over technical pathways to the forefront. (This article discusses only world models in embodied intelligence.)
Two celebrations share the same core keyword: World Model.
Yet paradoxically, the more people talk about embodied intelligence, the more ambiguous its definition becomes: some call generating realistic videos a world model, others refer to robot motion prediction as a world model, and still others label autonomous driving simulation environments as world models. Under the same term, entirely different technical goals and business objectives are packed.
The greatest danger to today’s world models has never been “poor definition,” but rather that everyone is defining their entire value based on the most visible, most viral aspect. When flashy “world-building” overshadows the core purpose of “using the world,” these models are being led away from where they truly belong: real-world physical scenarios for Physical AI.
A world model naturally requires the ability to "create a world." Without those impressive generative demonstrations, it would not have entered the public and capital spotlight so rapidly. But for the Physical AI industry, generating a world has always been just the beginning. The world must ultimately be controlled, verified, and corrected, becoming a simulation space and decision-making basis for machine actions. Video generation can open the door to world models, but it cannot complete the journey toward the real physical world.
We never lack new concepts and new narratives; embodied intelligence will inevitably carve out its own universal path. At that point, whether this path is called VLA, WAM, or something else entirely may no longer matter.
After all, it has become embedded in our lives.
World models are not entirely equivalent to "generating images."
Do you remember Sora?
When OpenAI released Sora, the report title was “Video generation models as world simulators,” announcing that video generation models could serve as a viable pathway toward a general-purpose simulator of the physical world. At the time, Sora demonstrated long-form videos with camera movements, local 3D consistency, and sustained object states, allowing the public to intuitively sense for the first time that AI might truly be learning to “build a world.” Compared to text and images, video naturally aligns with human intuition about the “world”—it encompasses time, space, motion, and continuous change, easily giving rise to the illusion that the model has grasped the laws of physics.
These capabilities are naturally suited for launch presentations and most likely to attract the attention of capital and the media. Over time, “video generation = world model” has become many people’s default entry point for understanding.
This is certainly not a mistake. In digital-native scenarios, video generation pathways are inherently efficient solutions, and numerous unicorn companies have already emerged. Their products can be used in the gaming industry to generate dynamic scenes in real time, reducing art production costs while enhancing player freedom. In high-cost-of-failure fields such as aerospace and advanced manufacturing, they expand testing boundaries and enrich simulation scenarios, offering clear commercial value. The “worlds” generated here are not merely visuals for audiences to observe, but interactive, testable simulated environments.
True misinterpretation occurs at the intersection of domains, when world models encounter embodied intelligence—many assume that because a model can generate a continuous, realistic digital world, it therefore possesses an understanding, predictive, and action-capable grasp of the physical world.
Wang Zhongyuan, President of the Beijing Academy of Artificial Intelligence, offered a sharp assessment: the video generation technology currently regarded as representative of world models is essentially nothing more than pixel-level world simulation. “Video generation models can create scenes of a group of pigs flying in the sky alongside airplanes, because their training data includes vast amounts of science fiction movie content— their goal has never been to replicate the laws of the real physical world.”
A classic embodied scenario illustrates the gap well: picking up a cup. The model can generate cups with consistent appearance from different viewpoints—that’s visual consistency, something it learns from video data. But what is the friction when reaching out to touch it? Can the material withstand the grip force? When the cup lands on the table, is it because the model remembers “cups are usually on tables,” or does it truly understand gravity, normal forces, and contact constraints? Complex mechanical responses, state changes after contact, and causal constraints imposed by real physical laws cannot be captured by a generated video. When a car generated to move sideways is inserted into an autonomous driving training pipeline without verification, the real physical world will eventually deliver a harsh reckoning.
In other words, video generation is one manifestation of a world model that has already been deployed in many scenarios, but it is by no means the world model required for embodied intelligence, nor is it the core form within the context of Physical AI. Defining the world model of embodied intelligence through the visual effects of “building a world” is essentially using the standards of the digital world to measure problems in the physical world.
Is VLA dead? World models are not a revolution—they’re a complement.
"VLA is dead; WAM has taken its place" is the most popular narrative within the industry.
Over the past two years, VLA has been the dominant approach in embodied intelligence. It follows the pretraining paradigm of large language models, establishing a "perception-instruction-action" mapping through vast amounts of teleoperation data, enabling robots to move beyond rigid repetitive actions toward understanding natural language and breaking down complex tasks. All major players in the industry have previously built their core technology on VLA.
However, VLA’s limitations are equally clear: it fundamentally relies on imitation learning, which amounts to memorization and mapping, lacking a foundational understanding of physical laws. When faced with new scenarios or objects not present in its training data, its generalization ability rapidly breaks down. Jim Fan’s WAM approach directly addresses this pain point. Its core logic shifts from “semantic understanding” to “physical prediction”: instead of directly outputting actions, it first predicts future world states and then derives action sequences in reverse—essentially allowing the robot to “simulate” the consequences of its actions in its mind before acting, thereby enhancing its adaptability to unfamiliar environments.
Thus, the "disruption theory" quickly gained traction, with VLA seen as an outdated paradigm and world models hailed as the next-generation solution for embodied intelligence. But in real-world industry practice, things are far more complex than a simple "either/or" dichotomy.
The industry is splitting into two clear paths, driven by distinct technological philosophies and business objectives:
One is the “alternative” camp led by Silicon Valley, represented by NVIDIA and Google DeepMind, leveraging abundant computational power and data reserves to pursue a complete paradigm shift. NVIDIA has integrated language, images, video, and action sequences into a unified Physical AI world model framework in Cosmos 3, aiming to eliminate the fragmentation between generation, simulation, and action prediction. Waymo and Google DeepMind’s collaborative Waymo World Model, powered by the Genie 3 model, does more than generate rare scenarios such as unusual weather or animals entering the road—it ensures these scenarios are governed by driving actions, road layouts, and linguistic conditions, enabling testing of autonomous driving systems’ responses in counterfactual situations.
This path is the most ambitious and best fits the "revolutionary narrative," but it has an extremely high barrier to entry—it's a game for the biggest players.
Another prevalent approach domestically is the “integrationist” camp. The vast majority of players have chosen not to start from scratch but instead to treat the world model as a complementary capability of VLA, embedding it within existing architectures. In May 2026, Zhi Square released AlphaBrain, its embodied large model based on VLA. Drawing inspiration from the human brain’s division of labor among the “cerebrum-cerebellum-trunk,” it integrates the world model’s “simulation” capability into the VLA architecture through a “fast-slow system” partnership: the slow system handles environmental situational awareness and high-level behavioral planning, while the fast system manages fine-grained sensing and rapid feedback. Zhi Square’s founder, Guo Yandong, put it bluntly: “The world model and VLA are not in conflict at all—they are simply branches of the same technological pathway. To perform longer-horizon reasoning tasks, you need either a combination of world model and VLA, or a merger of the two.”
Galaxy General has also made significant progress; their LDA-1B model, released in April this year, simultaneously performs policy learning, physical prediction, and visual perception within a unified framework, achieving for the first time at an industrial-scale billion-parameter level the integration of world models and action models. Their related work has been accepted by the robotics premier conference RSS, and both model weights and training code have been open-sourced. Rather than fixating on whether to choose a VLA or a world model, they take a more pragmatic approach by enabling prediction and execution to share a single model, leveraging the strengths of each while compensating for their respective weaknesses.
In our view, “replacement” versus “integration” is not a matter of right or wrong, but rather different choices at different stages. VLA will not truly “die,” and world models are not revolutionary disruptions that overturn everything—they simply fill the critical gap in VLA’s physical prediction capabilities. The ultimate relationship between the two is more likely to be layered collaboration, not zero-sum competition. What truly determines which path succeeds is not how trendy the concept is, but who can first connect the chain of data, simulation, and real-world deployment to bring robots into actual environments.
The world model hasn't been implemented yet, but hype around the concept has already taken off.
When conceptual hype outpaces technological implementation, bubbles are almost inevitable. In today’s world model赛道, at least three concerning bubbles have already emerged.
The first layer is defining the concept of a "world model." Today, the term has become a catch-all container for anything. Yann LeCun defines it as an abstracted prediction of world states, Li Feifei describes it as an interactive 3D spatial representation, NVIDIA positions it as a physics-based AI generative simulator, some startups use video generation as a placeholder, and others simply rebrand traditional simulation engines as "world models." Dozens of companies in China claim to be working on world models, yet they may not even be referring to the same thing. When a technical concept can be endlessly reinterpreted, it often loses its value as a technical benchmark. The broadening of definitions is driven jointly by funding needs and marketing narratives—after all, calling it a "world model" sounds far more valuable than labeling it a "video generation tool" or a "simulation optimization solution."
The second layer is the compute bubble. The dominant training approach for world models relies on massive video datasets and enormous computational power—exactly NVIDIA’s domain. At GTC, Jensen Huang explicitly stated that by 2027, Blackwell and Rubin chips, along with配套 systems designed for embodied AI models, will generate at least $1 trillion in revenue for NVIDIA. In a sense, the push by Silicon Valley’s leading players toward a “multimodal general world model” aligns perfectly with NVIDIA’s business logic of selling compute infrastructure. However, the entry barrier for this approach is virtually bottomless for the vast majority of companies. Even smaller teams that previously bet on VLA have struggled to bear such massive sunk costs, let alone enter the world model space from scratch. When everyone is discussing the same high-compute path but few can accurately calculate the return on investment, that itself is a sign of a bubble.
The third and most fatal layer is the落地泡沫. All conceptual narratives must ultimately answer the same question: can it actually improve real-world performance? The reality is that the gap between simulation and reality does not vanish simply because a model’s name changes from VLA to WAM. A minor penetration, anti-gravity effect, or blurred boundary in a video will become entrenched as incorrect physical understanding in robot training; a prediction that appears plausible but violates physical laws can mislead real robots even more severely than training without any model at all.
Shen Yujun, Chief Scientist of Ant Lingbo, highlighted the core difference: in the digital world, generative models can pursue high-definition realism, and a slower speed is acceptable; but for models in the physical world, the primary requirements are speed, stability, and accuracy—they must deliver real-time feedback and support actions. Many teams are fixated on rendering digital scenes with ever-increasing realism, yet overlook the fact that data from real physical interactions is the most scarce resource. A world model may achieve impressive metrics in simulation, but unless it has proven real-world value on factory production lines, logistics warehouses, or open roads, it remains a laboratory research project rather than an industrial-grade infrastructure.
So, what should a world model for Physical AI or embodied intelligence look like? The answer has never been in demo videos at product launches, but in the demands of real-world scenarios. Its core evaluation criterion has never been “how realistic the generated world is,” but rather “whether it helps machines act more effectively in the physical world,” whether it reduces trial-and-error costs, enhances generalization capability, and integrates seamlessly into real business workflows.
From current industry practices, the true leaders are all doing the same thing: shifting world models from “presentation-oriented” to “task-oriented.” In other words, the ultimate form of a world model is not a standalone “product,” but a foundational capability embedded within various physical systems. It resides in the simulation backends of autonomous driving, within robotic motion planning modules, and in predictive systems on factory production lines, quietly performing prediction, trial-and-error, and correction. Most of the time, users are unaware of its presence.
That was the era of world models, though it didn’t have to be called that.
