Engineering·March 22, 2026·10 min read

Standardising the capture rig: why consistent sensors make robot datasets usable

Diversity is what you place in front of the camera; standardisation is the camera itself. When capture is held constant, clips become comparable, evaluable, and learnable.

By Motionstack

One rig, every clip

A real-world data programme has two controls, and they pull in opposite directions. The first is content: who films, where, and performing which task. This should be as varied as the world the robot will eventually operate in. The second is capture: the camera, the lens, the frame rate, the clock, the motion sensors. This should be the opposite, as steady and identical as the hardware allows. The discipline of the work lies in holding the first control wide open while keeping the second fixed.

Many programmes handle the first half well and let the second drift. They issue a brief, ask contributors to film on whatever device they own, and accumulate a body of clips that is genuinely varied yet difficult to use. The diversity is real. The problem is that the model cannot distinguish which differences belong to the task and which belong to the equipment. This piece concerns the second control: what a standardised rig is, why it is the precondition for everything else, and how it determines whether a dataset trains a policy or confuses one.

Treat it as the companion to the case for data diversity. Vary the content; hold the capture constant. Both conditions must be met for either to be worth much.

What a standardised rig actually means

A rig is more than a camera. It is the full chain that converts a person performing a task into data a model can read, and standardising it means fixing the parameters that would otherwise drift between contributors and between cities. In practice this requires a defined and documented set of decisions:

  • Camera placement. Head-mounted at a fixed offset and orientation, so the egocentric viewpoint retains the same geometry across captures. The camera sits at eye position rather than wherever the strap settles.
  • Field of view and lens. A single optical formula across the fleet, so a doorway spans a comparable pixel width in Lagos and in Lyon. A different field of view is effectively a different dataset under the same label.
  • Resolution and frame rate. A fixed capture resolution and a fixed frame rate, since motion that resolves clearly at 60 fps degrades at 24 fps.
  • Exposure handling. A defined, constrained auto-exposure policy, so that a bright window does not re-grade half the clip and imply to the model that kitchens are darker than bathrooms.
  • Calibrated optics for derived motion. Known intrinsics across the fleet, so motion (camera ego-motion and 3D hand pose) is recovered from the video itself and aligns with the pixels frame for frame, with no separate sensor or clock to drift.

No individual choice here is exotic. The discipline is in making each choice once and holding it constant across thousands of captures and dozens of countries. That consistency is much of what the product is.

Why ad hoc capture produces varied but unusable data

Consider the uncontrolled approach. A programme writes a careful brief, recruits a diverse set of contributors, and lets each film on their own device. The returned clips vary along nearly every axis simultaneously. Person and place vary, as intended. So do the lens, the field of view, the frame rate, the auto-exposure behaviour, the rolling-shutter artefacts, the white balance, the codec, and the clock. There is typically no motion track, and little to synchronise it against if there were.

The model then faces an intractable attribution problem. When two clips of the same pour look different, it has no means to assign that difference. Is the second pour slower, or filmed at a lower frame rate? Is the second kitchen darker, or did a different device's auto-exposure decide it was? The task signal and the capture noise are entangled in the same pixels, and training cannot separate them, because the information required to do so was never recorded.

Uncontrolled capture does not add diversity; it adds a confound, and the model spends much of its capacity learning the camera fleet rather than the task.

Motionstack

This is the central difficulty. An ad hoc dataset can appear rich, often with more apparent variety than a controlled one. But variety that cannot be attributed is not coverage; it behaves as noise. The model may overfit to capture artefacts that happen to correlate with something incidental, or average them out and lose real signal in the process. A standardised rig is what allows diversity to carry its intended meaning.

Egocentric first-person video frame captured on a head-mounted rig during a household task
Every contributor, every city, the same viewpoint geometry and optics. The view changes because the world changes, not because the equipment did.

Calibration: the component most often overlooked

Standardising the hardware is the visible half. The less visible half is calibration, and it is where uncontrolled datasets fail under close inspection.

Calibration

Every camera has intrinsics (focal length, principal point, lens distortion), and lifting 2D pixels toward 3D motion requires those parameters to be known rather than assumed. A calibrated rig ships its intrinsics with the data, and they are identical across the fleet because the optics are identical. Calibration is also what gives the derived motion metric meaning: with known intrinsics, the camera ego-motion and 3D hand pose recovered from the video carry real scale rather than arbitrary units.

Motion without a second clock

Sensor rigs carry a quiet failure mode here. Video and an inertial unit are two streams sampled by two clocks, and clocks drift; a motion stream even tens of milliseconds out of step teaches a model that a hand reaches slightly before, or slightly after, it actually does, and at manipulation speeds tens of milliseconds separate a grasp from a miss. We avoid the problem at its source. Our motion is derived from the frames themselves (camera ego-motion, and 3D hand pose via HaMeR to MANO), so it is aligned to the video by construction. There is no second clock to drift and no offset to measure. Reference efforts such as Ego4D and its multi-view successor EgoExo4D treat alignment and calibration as first-class deliverables for the same reason: alignment must be a known quantity, not an assumption.

Egocentric video frame with its frame-aligned, video-derived motion track
Motion is recovered from the calibrated frames themselves, so it aligns with the video by construction rather than depending on a second sensor clock the model must trust.

How standardisation makes evaluation more honest

This is the argument that most often reframes the question. Standardisation is not only a training concern; it is also what makes generalisation measurable at all. A held-out test exists to vary the content of interest while holding everything else fixed, so that when performance drops the cause is identifiable.

When capture is standardised, a programme can hold out an entire city, an entire body type, or a class of homes, and ask a clean question: does a policy trained on the remainder still perform here? Because the rig is consistent across the split, a performance gap indicates that the policy failed to generalise across the content axis that was held out. That is actionable information.

When capture is not standardised, the same test reveals far less. The held-out city was also filmed on different devices, so a drop could reflect the city or the cameras, and the two cannot be separated. The likely outcome is a model that scored well on an in-distribution split and then fails in a real home, with no explanation available. Honest evaluation depends on isolating one variable at a time, and a standardised rig is what provides that control. Field-spread operator datasets such as DROID rely on a fixed capture setup precisely so that scene and task variation can be measured against a stable baseline.

The tension, and how it resolves

The apparent contradiction is now explicit. We argue for diversity while also requiring a great deal of sameness. The two seem to demand a choice.

The contradiction is an artefact of an imprecise term. Sharpen the word diversity and the tension dissolves. Any clip contains two kinds of variation: variation in what is happening, and variation in how it was recorded. The first is signal: the place, the person, the object, the manner in which the task succeeds or fails. The second is nuisance: the lens, the clock, the exposure curve. The resolution follows directly:

Maximise variation in the content and minimise variation in the capture, so a dataset is diverse where it matters and consistent where it should be.

Motionstack

A standardised rig is not the enemy of diversity; it is the instrument that makes diversity legible. Send the same controlled rig into a thousand different lives and the result is a dataset that varies widely in what the model should learn and remains constant in what it can safely ignore. That combination is the asset. Either half in isolation is a liability.

Precision about which axis each decision occupies is essential. Place, person, object, lighting as a property of the room, and the manner in which a task succeeds or fails are content, and each is pushed as wide as the brief allows. Camera position, lens, resolution, frame rate, exposure policy, and codec are capture, and each is held fixed. The line between the two columns is much of the design, and a dataset blurs precisely where the team blurs it.

1 rig
identical optics, frame rate, and calibration across the fleet
0 clocks
motion derived from the video, aligned by construction
1 variable
what evaluation isolates when capture is held constant

Why this also makes labelling and multi-clip learning tractable

The benefit extends well beyond the camera. Standardised capture is what keeps the work downstream of capture affordable rather than bespoke.

Labelling is the first beneficiary. When every clip shares the same field of view, the same frame rate, and motion derived on the same known geometry, annotation tooling and any auto-labelling can assume a fixed geometry. A grasp event has comparable temporal resolution throughout. An object enters frame at a comparable scale throughout. Annotators spend less effort re-learning each clip's idiosyncrasies, automated passes are less likely to fail on format drift, and a consistent action taxonomy maps cleanly onto data that does not shift beneath it.

Multi-clip learning is the larger prize. Models that learn across many demonstrations, whether by pooling, co-training, or batching diverse captures together, depend on those captures being commensurable. The community datasets that serve well as foundations, from the egocentric corpora to pooled efforts such as Open X-Embodiment, are those in which clips can be stacked because their capture is consistent. A standardised rig is what allows a thousand contributors' clips to function as one dataset rather than a thousand incompatible ones. That is the difference between data that scales and data that can only be collected.

What to ask a data partner

When commissioning real-world data, the rig questions separate serious suppliers from brokers. Ask them to state the field of view and frame rate in numbers, and to confirm both are identical across the entire fleet. Ask whether motion is derived from the video or bolted on from a separate sensor, and how it stays aligned to the frames. Ask for the calibration that ships with the data. Ask whether you can hold out a city and still trust the comparison. If the answers remain vague, the dataset is ad hoc capture behind a polished brief, and the model will surface that in production. Our approach is set out across solutions.

If you are building a policy and need capture you can genuinely evaluate against, tell us the spec. We field diverse people in diverse places on one standardised rig: varied where it matters, consistent where it should be, synchronised and calibrated so each clip remains comparable.

Get the real-world data your robot needs.

Tell us the task, the person, and the place. We field it from a network of 800k contributors and deliver it to spec, cleared for commercial training, in about four weeks.