Research·May 18, 2026·9 min read

Why diverse home data matters more than volume for manipulation

A model that sees one tidy kitchen learns that one tidy kitchen. Generalisation for manipulation is driven by variety of place, person, and task, not by raw hours alone.

By Motionstack

Coverage over volume

Manipulation policies frequently degrade outside the lab, for a straightforward reason: real environments seldom repeat. A drawer sticks, a counter is cluttered, the lighting shifts, the mug is unfamiliar, and the contributor is left-handed. A model trained on a single clean demonstration set learns that set rather than the underlying task. Confronted with a kitchen it has never observed, it has little to draw on, and it interpolates poorly.

A common response is to scale the effort: collect more hours, enlarge the model, and rely on emergent capability. In some regimes this is effective. For many physical-world tasks, however, it falls short, because the scarce resource is not quantity. It is coverage. This post examines the distinction, why it is decisive for manipulation, and what a dataset requires in order to carry a policy beyond the demonstration.

Volume is not coverage

Raw hours are an appealing target, in part because they are easy to count. But ten thousand hours of the same kitchen, filmed the same way by the same person, teaches a narrow distribution very well and almost nothing else. The model becomes confident and wrong simultaneously, a problematic combination for a system that will eventually handle real objects in a person's home.

Coverage, by contrast, is variety along the axes that actually shift the task distribution. For home manipulation, three axes matter together:

  • Place. The same task in a Tokyo galley kitchen, a Zurich apartment, and a São Paulo flat, with their distinct layouts, counter heights, appliance types, clutter, and lighting.
  • Person. Different body sizes, handedness, grip strength, pace, and style. A policy that has seen a single demonstrator inherits that demonstrator's idiosyncrasies as if they were part of the task.
  • Task variation. Not simply 'pour water', but pouring from a full kettle, from a near-empty one, into a narrow glass, with a slippery handle, or mid-conversation.

Moving along these three axes pushes the model to learn what is invariant about the task and to discard what is incidental about the room. This is the central mechanism. Generalisation is not a model property to be unlocked. It is better understood as a property of the training distribution.

The same task, captured across many parts of the world, teaches a policy what the task is. The same task captured once teaches it what your kitchen is.

Motionstack

The evidence from large robot datasets

This is not only intuition. One of the clearer signals in recent robot-learning research is that cross-embodiment, cross-environment data improves generalisation more reliably than additional data of the same kind. The Open X-Embodiment effort pooled data across many robots and labs, and found that co-training on the diverse mixture produced policies that transferred better than any single narrow dataset. RT-2 demonstrated that grounding action models in broad, web-scale visual diversity improves their ability to handle novel objects and instructions.

A comparable pattern appears in egocentric video. Ego4D established that thousands of hours of first-person activity across hundreds of locations is a categorically different, and frequently more useful, asset than a larger collection from one place. Operator-collected sets such as DROID deliberately distribute collection across labs and scenes, in part because in-distribution accuracy is comparatively cheap while out-of-distribution robustness is the property worth paying for.

The lesson is consistent: broadening the distribution leads policies to memorise less and generalise more. Deepening it primarily yields a stronger demonstration and a weaker product.

Why this data was rarely available online

If diverse, real, first-person demonstrations are this valuable, why not simply scrape them? Because they were rarely posted. The internet holds a vast quantity of polished how-to videos produced for an audience: edited, third-person, without synchronised motion, without consent for commercial AI training, and skewed toward the narrow set of tasks people make content about. What a manipulation policy requires, unedited first-person captures of ordinary people performing ordinary chores, with the motion attached and the rights cleared, was never part of the web's training set.

This reframes the bottleneck from a scraping problem into a sourcing problem: placing the right person, in the right environment, to record the right task, on a consistent rig, cleared for commercial use. It is a matter of logistics rather than luck, and logistics can be engineered at scale.

Isometric cutaway of a kitchen with a person performing a cooking task
The same task, captured across kitchens, bathrooms, and living rooms in dozens of cities, is what helps a policy learn the task rather than the room.

Diverse content, consistent capture

There is a genuine tension on the other side of this argument. Maximising diversity with no standard at all, using arbitrary phones, angles, and framing, produces a dataset that is varied but difficult to use, because the model cannot separate task variation from capture noise. The discipline is to vary what should vary (place, person, task) while holding constant what should not (camera placement, field of view, frame rate, and sensor configuration).

This is the combination we build for: broad diversity in content, captured on a standardised rig so that every clip remains comparable. Diverse content with consistent capture is what makes a dataset usable end to end, not only at the edges. It also makes evaluation more rigorous, because a held-out city or person type can measure whether the policy generalises, rather than confirming performance on an in-distribution test split.

A practical rule of thumb

When validation loss is strong but the robot fails the moment it leaves the lab, the answer is rarely a larger model. More often it is more varied, real, consented demonstration data, paired with a held-out slice diverse enough to surface the failure before the customer does.

3 axes
place, person, and task variation, varied together
1 rig
standardised capture so clips stay comparable
50+
countries we field contributors across

What to specify when you commission data

When scoping a manipulation dataset, specify coverage explicitly rather than volume alone. A sound specification names:

  1. The task, in concrete terms, including the variations and failure modes that matter.
  2. The person distribution: how many distinct contributors, and across what demographics and body types.
  3. The place distribution: how many distinct homes, in which cities, and with what layout variety.
  4. The capture standard: framing, field of view, frame rate, and which modalities (video, motion, depth) require synchronisation.
  5. The held-out slice for evaluation, chosen to test generalisation rather than memorisation.

Putting coverage in the specification changes the definition of 'done'. The deliverable shifts from 'X hours' toward 'X hours that span the distribution the policy will actually encounter.' This is the dataset that holds up on contact with a real home.

If you are training a manipulation policy and performance falls away outside the lab, tell us the spec. We field the people, in the places you need, on a consistent rig: diverse where it counts, and consistent where it should be.

Get the real-world data your robot needs.

Tell us the task, the person, and the place. We field it from a network of 800k contributors and deliver it to spec, cleared for commercial training, in about four weeks.