Scaling laws are a large part of why large language models came to resemble engineering rather than alchemy. The work behind them established that test loss falls in a smooth, predictable fashion as parameters, data, and compute increase, and that the three must grow in concert to remain near the efficient frontier. The Chinchilla analysis sharpened this picture: for a given compute budget there is a near-optimal ratio of model size to training tokens, and many large models of the era were substantially oversized and undertrained. Once the return on the next order of magnitude can be estimated in advance, a research program becomes a matter of planning rather than speculation.
Whether comparable curves govern humanoid and embodied learning is a natural question. If they did, the recipe would be straightforward: collect more hours of robot or human demonstration, train a larger policy, and expect competence to emerge on schedule. Part of this intuition does carry over, and larger models trained on more data generally help. The property that made language scaling so clean, however, does not transfer as readily, and the reason lies in the structure of the data.
Text on the internet is abundant and diverse almost by accident. Physical-interaction data is neither. It is scarce, expensive to capture, and narrow by default unless breadth is engineered in deliberately. That single difference reshapes the question. This post examines what scaling laws predict for embodied AI, why scaling volume within one narrow source underperforms, and which axis is more worth scaling instead.
What scaling laws actually claim
The headline result is narrow, and stating it precisely is worthwhile. Given a fixed model architecture and a fixed data distribution, test loss follows a power law in model size, dataset size, and compute. Increase any one and loss declines along a smooth curve. Increase them together in balanced proportion and the model remains close to the compute-optimal frontier. This is much of what allows a lab to forecast a model's performance before training it.
Two assumptions are embedded in that statement, and both carry considerable weight. The first is a fixed data distribution: the law describes the effect of sampling more from the same source, not what happens when the source is exhausted or when test conditions drift away from training. The second is that loss is the quantity of interest. For a chatbot, next-token loss correlates closely enough with usefulness. For a robot, the relevant metric is whether the policy completes a task in a setting and on a body it has not previously seen. These are precisely the two assumptions that embodied AI strains.
Why physical data differs from internet data
Language models inherited something close to a free lunch. The web already held trillions of tokens spanning thousands of topics, registers, languages, and styles, produced by millions of people over decades. Diversity arrived bundled with volume at negligible marginal cost, and no sentence had to be commissioned. Physical-interaction data offers none of this. Few people have been uploading synchronised first-person video of ordinary tasks with motion attached and rights cleared for commercial training. The asset does not exist until it is captured on purpose.
This changes the economics considerably. With text, the marginal token is nearly free and arrives already diversified. With embodied data, every hour must be sourced, captured on a rig, and cleared for use, and the default hour closely resembles the last: the same operator, the same lab, the same small set of objects. When a team scales embodied data, the easy axis to scale is therefore more hours of the same narrow distribution. That is precisely the move scaling laws are least likely to reward, because it deepens a distribution the policy has already mastered rather than widening the one it will be evaluated against.

Where volume scaling tends to slow
Consider increasing a policy's training data tenfold, all of it the same task, in the same lab, performed by the same person. In-distribution accuracy climbs and the validation curve looks encouraging. The robot then encounters a counter at a different height, a mug with a slicker handle, or a left-handed grasp, and performance degrades. What the additional data has purchased is not generalisation but a more confident specialist in a setting the deployed system may rarely encounter.
This is not a contradiction of scaling laws. It is the fixed-distribution assumption coming due. The law predicts lower loss on samples drawn from the training distribution, and that is what occurs. The difficulty is that the deployment distribution sits elsewhere, and additional in-distribution data does not move the policy toward it. The model is, in effect, optimising a surface on which the test set does not lie.
More hours of the same task in the same room teach a policy what a particular lab looks like. The same task across many rooms, bodies, and objects teaches it the structure of the task itself.
Motionstack
Stated plainly, scaling volume along a narrow source yields steeply diminishing returns on the metric that matters most: out-of-distribution success. The first hours captured in a new setting are worth far more than the thousandth hour in a familiar one. That curve bears little resemblance to the smooth power law that practitioners hope to inherit from language.
The axis that seems to move generalisation
One of the clearer findings in recent robot-learning research is that broadening the data distribution accomplishes what deepening it cannot. The Open X-Embodiment collaboration pooled demonstrations across many robot platforms and labs, and the RT-X models trained on that mixture transferred more reliably than policies trained on any single narrow dataset. Co-training on a diverse pool allowed skills learned on one embodiment to appear on another. Generalisation, in this case, derived from variety rather than from raw count.
A similar theme runs through the vision-language-action literature. RT-1 demonstrated that scaling the breadth of tasks and scenes, not merely the number of episodes, was much of what produced a policy able to absorb new instructions. RT-2 extended this by grounding action models in web-scale visual and semantic diversity, which improved their handling of objects and instructions absent from the robot data. In both cases the operative lever is coverage of the world rather than depth in one slice of it.
Egocentric human video supports the same conclusion from a different direction. Ego4D established that thousands of hours of first-person activity distributed across hundreds of locations constitute a more useful asset than a larger corpus gathered in one place. Operator-collected robot datasets such as DROID deliberately distribute collection across many labs and scenes, on the premise that in-distribution accuracy is inexpensive while out-of-distribution robustness is the property worth paying for. None of these efforts succeeded by running a single setup for longer.
Which axis to scale, alongside the hours
If the objective is generalisation, the variable most worth scaling is coverage: the number of distinct tasks, people, places, objects, and embodiments the data spans. Hours are a reliable proxy for coverage only when each additional hour contributes something the policy has not seen. Once hours begin to repeat, the proxy breaks down and further spending buys little new capability.
For real-world human task data, the coverage axes that move generalisation are concrete:
- People. Body sizes, handedness, grip strength, pace, and style. A policy trained on a single demonstrator inherits that demonstrator's idiosyncrasies and encodes them as part of the task.
- Places. A galley kitchen in Tokyo, an apartment in Zurich, a flat in São Paulo. Differences in layout, counter height, appliances, clutter, and lighting help the model separate the task from the room.
- Tasks and objects. Not a single canonical demonstration but a fuller distribution of variations and failure modes: a full kettle and a near-empty one, a narrow glass, a slippery handle, an interrupted attempt.
- Embodiments and viewpoints. Cross-embodiment co-training and consistent first-person capture, so that skills transfer across platforms rather than overfit to one rig.
A subtler failure mode sits on the far side of this argument. Maximising diversity with no capture standard at all, with arbitrary devices, angles, and framing, produces a dataset that is varied yet difficult to use, because the model cannot reliably separate genuine task variation from capture noise. The discipline is to vary what should vary, namely people, places, and tasks, while holding constant what should not, namely camera placement, field of view, frame rate, and sensor configuration. This is the combination we build for: diverse content captured on a standardised rig so that every clip remains comparable.

Compute, data, and diversity as a budget
The Chinchilla lesson for language was that compute, model size, and tokens must be balanced, and that many labs were misallocating by training oversized models on too little data. Embodied AI faces an analogous budgeting problem with an additional dimension. The trade-off is among compute, total data, and the diversity of that data simultaneously, and the binding constraint is usually diversity, because it is the most expensive to acquire and among the most valuable once held.
A useful framing: compute is inexpensive relative to physical data, and narrow physical data is inexpensive relative to diverse physical data. The optimisation therefore looks less like the language frontier, where data and parameters scale together, and more like a sourcing problem. The highest-leverage expenditure is rarely another hour of an existing setup. It is the first hour of a setup not yet represented in the corpus.
This reframes data acquisition less as a scraping problem and more as logistics: placing the right person, in the right location, to record the right task, on a consistent rig, cleared for commercial use. Logistics can be engineered and scaled deliberately, which is much of the premise behind how we source data and the solutions we build around it.
What this might mean for budgeting a data program
For teams planning an embodied-AI data spend, the practical implication is to budget for coverage explicitly rather than allowing hours to stand in for it. Several practices keep a program honest:
- Specify the deployment distribution first. Name the people, places, and task variations the policy is likely to encounter, then size collection to span them rather than to reach an hours target.
- Hold out a coverage slice. Reserving a city, a contributor type, or an object set the model never trains on gives a clearer measure of generalisation than an in-distribution split.
- Treat diversity as the scarce input. When the marginal hour begins to repeat existing data, redirect spend toward a new person, place, or embodiment rather than more of the same.
So, do scaling laws hold?
Partly, and with an important qualification. Within a fixed data distribution, larger models and more data do help embodied policies, and balancing compute against data remains the correct instinct. The clean, forecastable power law that made language scaling feel like engineering does not transfer cleanly, because its underlying assumptions, a free and diverse data supply and a loss metric that tracks usefulness, hold poorly for robots. The binding constraint is less the number of hours a team can afford and more how much of the real world those hours actually cover.
A more accurate scaling law for embodied AI is therefore less about quantity and more about distribution. Widen the distribution and generalisation follows. Deepen a narrow one and the result is a stronger demonstration and a weaker product. Labs that internalise this shift from asking how to collect more toward asking how to collect more widely, which is a different and considerably more tractable question.
If you are training a humanoid or manipulation policy and deciding where the next data dollar should go, tell us the spec. We field the right people, in the right places, on a consistent rig, diverse where it counts and consistent where it should be.