A world model is a learned predictive model of how an environment evolves over time. Given the current state and an action, it predicts the next state. Applied recursively, this produces a rollout: a system can imagine a sequence of actions, estimate the consequences of each, and select the sequence that best advances its objective. That is the core mechanism. The substantive questions concern everything beneath it: how the state is represented, how the prediction is conditioned, and what the model must observe in order to predict well.
The term is used loosely, so precision is worth establishing. A world model is action-conditioned: its predictions depend not only on the current observation but on the action the agent takes next. It is generative: it produces plausible future observations rather than merely classifying the present. And it is typically compressed, operating over a learned latent state rather than raw pixels, because reconstructing every pixel is wasteful and predicting the task-relevant structure of a scene is the actual objective.
This article explains where world models originate, what they are used for, and why models intended for the physical world require diverse, real, motion-synchronised human activity video, the kind of data that was largely never recorded for the web in the first place.
A brief lineage
The modern framing traces to Ha and Schmidhuber's 2018 paper World Models, which made the architecture explicit. They decomposed an agent into a vision module that compresses observations into a small latent code, a memory module that predicts how that latent evolves over time, and a compact controller that acts on top of it. Their central result was that an agent could learn a competent policy almost entirely inside its own learned model of the environment and then transfer that policy back to the real task. The model functioned as the simulator.
The line continues through latent-dynamics research, most prominently the Dreamer family. DreamerV3 learns a world model from experience and trains its behaviour by imagining trajectories in the model's compact latent space rather than acting in the real environment for every gradient step. This is the central efficiency argument for world models: rollouts inside a learned model are cheap, so an agent can plan and learn from far fewer real interactions than model-free methods require.
A parallel line of work is action-conditioned video prediction: models that take a frame and an action and predict the next frame directly in pixel or token space. The recent generation of large video and world models extends this approach, learning rich dynamics from large corpora of footage and, in the controllable variants, allowing the generated future to be steered by actions. The implementations differ; the underlying premise is shared. A model that can predict what happens next has learned something durable about how the environment behaves.
Several recurring design choices distinguish a world model from a video generator. The first is the latent state: the model predicts the evolution of a compressed representation that captures the scene structure relevant to acting, rather than reconstructing every pixel. The second is the rollout: a world model is trained so that its predictions can be chained, each output feeding the next input, which is what makes planning and imagination possible. The third is the conditioning signal, which in the embodied case is an action. Remove any of these and the result is something else, a classifier, a one-step predictor, or a passive video model that does not answer the question an agent cares about. That question is not what will the world do, but what will the world do if I take this action.
A world model treats prediction as a form of understanding: a system that reliably forecasts how an environment responds to action has internalised the dynamics of the data it was trained on, and, critically, only those dynamics.
Motionstack
What world models are used for
A world model is rarely an end in itself. It is an instrument, and it earns its place in a few distinct ways. Each application shapes what is required of the data underneath it.
- Model-based reinforcement learning. Rather than learning a policy purely by trial and error in the real environment, the agent learns a model of the environment and trains the policy against that model. This is the Dreamer recipe, and its principal advantage is sample efficiency. Real interactions are slow and expensive, particularly on hardware, while imagined ones are cheap.
- Planning. Given a model that predicts the outcomes of actions, an agent can search over candidate action sequences and select the one with the most promising predicted result. The model becomes a fast, queryable approximation of consequences.
- Simulation and data generation. A capable world model serves as a learned simulator. It can roll out scenarios that would be difficult, slow, or unsafe to stage physically, and supply synthetic experience for downstream training.
- Evaluation. Rolling a policy forward inside a world model is an inexpensive way to stress-test it before deployment, surfacing failure modes that a static test set would miss.
A single principle runs through all four. The world model stands in for reality so that the expensive elements, real interaction, real risk, and real time, can be reduced. That substitution is only as reliable as the model is accurate. A world model that has not observed a given class of situation does not decline to predict it. It predicts confidently and incorrectly, and the systems downstream inherit the error.

Why physical-world models are data-hungry in a specific way
Most large models are data-hungry. World models for the physical world are hungry in a particular and demanding way, which follows from a single constraint: to predict real dynamics, a model must have observed real dynamics. Not described, illustrated, or approximated, but observed in motion, at the resolution the prediction requires.
Consider what the model is being asked to learn. How a half-full kettle tips, and the moment the water begins to pour. How cloth bunches against how a rigid box slides. How a cabinet door swings, and where a hand must be to catch it. How contact changes a system in the instant two surfaces meet. None of this can be read off a single image. It lives in the transitions between frames, in the timing, in the way force translates into motion. A model can learn a transition only from examples of that transition.
It must also observe those transitions in diversity, or it learns the wrong invariants. A world model trained on one kitchen, one person, and one lighting condition predicts that kitchen's dynamics accurately and generalises poorly beyond it. The training distribution becomes the distribution the model treats as the world. Coverage across places, people, objects, and task variations is not an enhancement here; it is the difference between a model of the world and a model of one room. We have made the broader case for coverage elsewhere, and it applies with particular force to world models because their errors compound. A wrong one-step prediction becomes a far wronger ten-step rollout.
Action-conditioning needs the action
This is where existing footage typically falls short. An action-conditioned world model must learn the mapping from action to consequence, which means the training data cannot be video alone. It must be video with the action attached: the motion that produced what appears on screen, synchronised frame for frame with the visual stream.
Watch someone pour a glass of water in an ordinary clip and you see the result of the motion, but not the motion itself, not the trajectory of the wrist, the grip, the speed, or the correction made halfway through. Without that signal a model can learn to generate plausible futures, but it cannot reliably learn what a given action would do, because it never observed which actions produced which outcomes. Action-conditioning is the component of the problem that requires synchronised motion data, and it is precisely the component absent from most footage already recorded.
Synchronisation is not a detail to be reconciled afterward. If the motion track drifts even a few frames out of alignment with the video, the model learns a corrupted mapping, binding an action to a consequence the action did not cause. At rollout time those misattributions accumulate, and the predictions drift in ways that are difficult to detect. The same holds for motion granularity. A coarse body pose is sufficient to learn that someone reached across a table; it is insufficient to learn the fine hand dynamics that determine whether a grasp holds or slips. The fidelity a model can reach is bounded by the fidelity of the action signal it was trained on, which is why the capture rig and the synchronisation pipeline matter as much as the volume of footage.

Why this data was rarely on the web
If diverse, action-conditioned, first-person activity data is what physical-world models require, the obvious question is why it cannot simply be scraped. The answer is that it was, for the most part, never created. The web holds an enormous quantity of video, but little of it is the right shape for this purpose.
Most online video is third-person and edited, shot for an audience rather than for a model. It carries no synchronised motion track, so the action that produced each frame is gone. It skews toward the activities people make content about and away from the ordinary, repetitive physical tasks that fill much of a real day. And it carries no consent for commercial AI training. Foundational egocentric efforts such as Ego4D demonstrated how valuable large-scale first-person footage can be, and, by being a deliberate research collection rather than a scrape, also established that this kind of data must be commissioned rather than found.
The robotics side shows the same pattern. Open X-Embodiment pooled action-labelled trajectories across many robots and laboratories precisely because paired action-and-outcome data does not exist at scale in the wild; it must be collected deliberately. World models for the physical world sit downstream of the same gap. The footage that would teach them real dynamics, in diversity, with the motion attached, was simply never part of the internet's training set.
What to look for in world-model data
For anyone building or training a world model for embodied tasks, the data specification matters at least as much as the architecture, since the architecture can only learn what the data contains. An effective specification names a few things explicitly:
- First-person, egocentric video that captures activity from the actor's point of view, where contact and consequence occur.
- Synchronised motion, with hand pose and camera ego-motion aligned frame for frame with the video, so the model learns action-conditioned transitions rather than appearance alone.
- Coverage across places, people, objects, and task variations, so the model learns what is invariant about the dynamics rather than what is incidental to one setting.
- A consistent capture standard (framing, field of view, frame rate, camera configuration), so the model can distinguish task variation from capture noise.
- Cleared commercial rights, so the data you train on is data you can ship.
Meet that specification and the model can learn the dynamics rather than a caricature of them. Fail it and the result is a fast, confident predictor of a world that does not quite exist, which is worse than no model at all because it fails quietly. Our end-to-end approach is built around these requirements: diverse, real, first-person capture with motion synchronised and rights cleared, on a consistent rig.
If you are training a world model for the physical world and lack the right kind of data, diverse, real, first-person, with motion synchronised and rights cleared, tell us the spec. We can field the people, the places, and the tasks, captured the way a world model needs to observe them.