On June 3, 2026, the World Labs team, in collaboration with Stanford University professor Fei-Fei Li, published a conceptual analysis paper with a straightforward title: “A Functional Taxonomy of World Models.” The paper opens with a direct challenge to an industry unspoken rule: “World models are among the most important and most misused terms in artificial intelligence today.”
The context of this statement is familiar to anyone who has followed the AI industry.
In February 2024, OpenAI released the video generation model Sora, with its technical report boldly titled “Video Generation Models as World Simulators.” At the time, Jim Fan, Director of Robotics at NVIDIA, left a comment on LinkedIn that was later widely cited: Sora is essentially a “world model that allows no action as the only action.” Meanwhile, according to public reports, Tesla’s AI team has repeatedly referred to the prediction component within its Full Self-Driving system as a “world model” or “world simulator.” Game engines, 3D generation tools, embodied intelligence models—all manner of products and technologies—are being lumped together under the same label.
A video generator, an autonomous driving prediction network, a robot control model, and a physics engine—what do they have in common? Almost nothing. Yet all of them are called “world models.”
After more than two years of conceptual confusion, someone has finally attempted to systematically clarify it. The Li Fei-Fei team did not release a new model, announce a new benchmark, or demonstrate any product features. Instead, they did something more fundamental: they returned to the theoretical origin—partially observable Markov decision processes—and reduced all systems currently labeled as “world models” to three distinct functional projections of a single cognitive loop.
The three projections are: renderer, simulator, and planner. Under World Labs' classification framework, Sora and similar video generation models fall into the category of renderer.
How can one term encompass so many contradictory meanings?
To understand the root of this chaos, we first need to ask a more fundamental question: When a company says, “We’re building a world model,” what exactly does it mean?
For OpenAI, Sora’s goal is to “understand and represent the physical world in video.” According to the technical report, Sora learns statistical patterns from vast amounts of video data, enabling it to generate visually coherent scenes—such as a cup shattering when it hits the ground, a paper airplane flying when released, and a person’s legs alternating as they walk. These scenes appear to “understand physics.”
For Tesla, the "world model" is a neural network within the FSD system that predicts the future trajectories of road participants over the next several seconds. It must output precise 3D positions, velocities, and orientations to enable the path planning module to compute safe driving decisions. This model does not output pixels; instead, it outputs vectors and probability distributions.
For robotics companies, a "world model" is an internal simulation mechanism that enables robotic arms to predict, "If I push this cup 5 centimeters to the left, will it tip over?" It requires understanding object properties, contact mechanics, and stability, and outputs an assessment of action feasibility.
The goals of these three types of companies are entirely different. Video generation companies care about pixel fidelity, autonomous driving companies care about the accuracy of physical state prediction, and robotics companies care about the predictability of action outcomes. They are all building “world models,” but they are not doing the same thing at all.
World Labs directly identifies the core issue in the article: these systems are all given the same name because they indeed each capture one aspect of "understanding the world." However, each only fulfills one step in the complete cognitive cycle, yet has been packaged by marketing language, media coverage, and capital narratives as a complete world model.
Another driver of conceptual confusion is the inherent tension in the terminology itself. The term “world model” carries a grand narrative quality, sounding more imaginative and better suited to supporting high valuations and fundraising stories than phrases like “video generation model” or “video prediction model.” When technical capabilities fail to match public expectations, it becomes inevitable that the concept is reduced to a marketing tool.
In the 1960s, what would a complete "world model" have looked like?
World Labs' classification framework is built on what appears to be an ancient theory: the partially observable Markov decision process.
This framework describes a complete cycle of interaction between an agent and its environment. The agent, in a given environmental state, performs an action that alters the state of the environment. The agent then receives partial observations through its sensors, which trigger an update to its internal state. This updated cognition drives the next action, and the cycle repeats.
Within this framework, the full functionality of a "world model" should encompass three components: generating observations from states (pixels, point clouds, etc., seen by the human eye or captured by sensors), predicting the next state from actions and the current state (forecasting physical changes), and generating actions from observations and goals (decision-making and planning).
Language models learn the statistical patterns of text sequences, while world models learn the statistical properties of space and time—such as how light reflects off different surfaces, how objects move under gravity, and how energy is transferred after rigid-body collisions.
The World Labs team points out that all systems currently labeled as "world models" are merely projections of a single functional component within the complete cycle above. Some systems only perform rendering from state to observation, others only perform state prediction from action to next state, and still others only perform planning from observation to action. Each captures just an arc of the cycle, yet all are labeled with tags implying the full circle.
The value of this analytical framework lies in providing a comparative coordinate system that goes beyond marketing rhetoric. No matter how a company packages its product, placing it back into the POMDP cycle and examining its inputs, outputs, and missing components will reveal its true capabilities.
The boundaries of capability for renderers, simulators, and planners
In World Labs' taxonomy, the first category is defined as "renderers." Its core objective is to generate high-fidelity pixel outputs tailored for human visual perception. The input is a representation of some environmental state (which can be a text description, 3D scene parameters, or implicit encoding), and the output is a sequence of continuous frames.
The renderer's optimization focuses on visual realism rather than physical accuracy. World Labs' article explicitly states that the rendered buildings may appear "unstable" because the system does not actually solve structural mechanics equations; the splashes of liquid may look realistic, but the volume, flow rate, and impact force of the liquid may bear no relation to real-world physical quantities. Therefore, such models cannot be used for architectural design, robot training, or any task requiring physically accurate simulations.
Google's Genie 3, various text-to-video models, and nearly all AI video generation tools fall into this category. Sora is included as well.
The second type is a "simulator." Its primary goal is not to generate visuals for human viewing, but to produce precise states suitable for subsequent computations. The input consists of the current environmental state and external forces (or actions), and the output is the next state, faithfully adhering to the laws of physics and geometry. The state generated by the simulator can be used for stress analysis, energy consumption calculations, and collision detection, and may also serve as input to a renderer to produce visualizations. However, its core value lies in the computability of the state itself.
NVIDIA Omniverse is a representative example of such systems. It is not an AI-native model, but rather a digital twin platform that integrates traditional physics engines with AI-accelerated computing. According to World Labs in their article, simulators serve as a bridge between rendering and planning, but the scarcity of high-quality 3D physics-annotated data remains a major bottleneck. World Labs estimates in their article that the data required to train such models is several orders of magnitude less than the video data available on the internet.
The third category is the "planner." It takes as input observational data (such as camera footage, LiDAR point clouds, and tactile sensor readings) and goal instructions, and outputs the next action to execute. Both VLA (Vision-Language-Action) models and World Action Models fall into this category.
The differences among the three categories are not minor technical variations but fundamental functional distinctions. Renderers output pixels for humans to see, simulators output states for machines to compute, and planners output actions for executors to carry out. A system can possess multiple capabilities simultaneously, but when most systems referred to as "world models" essentially only perform rendering, equating "rendering" with "understanding the world" is a serious cognitive mismatch.
A two-year debate: Is Sora a world model?
In February 2024, OpenAI released Sora, with the technical report titled directly: “Video Generation Models as World Simulators.” This terminology immediately sparked intense debate within the academic community and the developer ecosystem.
Supporters argue that the videos generated by Sora demonstrate 3D spatial consistency, object permanence, and an intuitive understanding of physical interactions. A bitten burger leaves tooth marks, and a dog running through snow kicks up snowflakes—these details suggest the model has learned certain physical laws.
The core argument of critics stems from the classical definition of a world model in reinforcement learning: a world model must be capable of predicting state transitions based on actions. That is, given a current state and an action input, the model should output the subsequent state following that action. Sora cannot do this. Users cannot instruct Sora to “push the cup from the left,” then observe whether the cup falls, in which direction it topples, or where the shards fly.
Jim Fan’s comment precisely captures this contradiction: “Sora is essentially a world model, except that it only allows no-op as the sole action.” This means that Sora does predict how the environment evolves over time, but this evolution occurs without any external intervention—it can only unfold along the causal chains inherently present in the video data. It is not performing interactive reasoning, but rather continuing a sequence of passive observations.
On Reddit’s r/MachineLearning subreddit, several reinforcement learning researchers expressed sharper criticism: a system that cannot predict state transitions based on actions cannot be called a world model—it can only be called a video prediction model.
World Labs' classification framework provides a definitive answer to this debate. In the POMDP loop, actions are the critical inputs that drive state transitions; a system lacking this input is merely a projection of the "observation generation" component within the full cognitive loop. Sora is a renderer, not a complete world model, and certainly not a world simulator.
But this doesn’t mean Sora has no value. Renderers address a different problem: how to generate images that align with human visual expectations. This challenge is inherently extremely difficult and carries significant commercial value. The issue lies in packaging rendering capability as “understanding the world,” which can mislead technical decision-makers and investors into believing these models already possess physical reasoning or embodied interaction capabilities.
Industry value of conceptual clarification
Clarifying the boundaries of the definition of "world models" is not an academic exercise in semantics—it directly impacts technology selection, investment decisions, and public understanding of AI capabilities.
For a manufacturing company evaluating whether to use a certain “world model” for robot training, it is essential to determine whether the model is a renderer, a simulator, or a planner—to avoid costing millions of dollars in trial and error. A model that can only generate video visuals, no matter how realistic, cannot replace precise calculations of forces acting on objects, motion trajectories, and collision outcomes.
For investment institutions, distinguishing among the three categories of projections enables more accurate identification of where a project stands in the technology stack. A startup claiming to be a “world model” but whose product is essentially a renderer has competitors in video generation companies, not digital twin platforms or robotics control models. This directly determines how the market size is estimated and which peer companies are selected for comparison.
For academia, clear categorization is a prerequisite for establishing comparable benchmarks. If the term "world model" continues to be generalized, researchers will struggle to define what constitutes an improvement versus a breakthrough, and peer review will be based on ambiguity.
World Labs also points out in the article that clarifying concepts is not intended to create opposition. The future direction will be the integration of these three types of projections. A model that truly understands the physical properties of a cup should be able to render its visual appearance, simulate the physical process when it is knocked over, and plan how a robotic arm can grasp it stably. However, before technology reaches that stage, recognizing the boundaries of each approach is more practical than imagining their fusion.
According to World Labs' article, simulators and digital twin technologies, exemplified by NVIDIA Omniverse, target potential markets exceeding $1 trillion in sectors such as factories, warehouses, and supply chains. This figure comes from vendors' own assessments; whether the market can truly reach this scale depends on whether simulators can overcome the bottleneck of scarce high-quality 3D physical data.
For the current stage of the AI industry, the most important insight may be surprisingly simple: the ability to generate realistic videos does not equate to understanding the physical world; being called a world model does not mean it truly simulates the world. Cutting through marketing language and examining what inputs a system receives, what outputs it produces, and which components are missing within a POMDP loop is the most honest way to assess the boundaries of its technical capabilities.
