Engineering·May 14, 2026·10 min read

Teleoperation vs Human Video for Robot Manipulation Data

Teleoperation yields exact, on-embodiment action labels but is costly to scale; human video scales readily but leaves actions to be recovered. The two are complementary, and most production data programs combine them.

By Motionstack

Teleop or human video?

Manipulation training data overwhelmingly originates from one of two sources. In the first, a human drives the robot directly, frame by frame, and the system records exactly what the robot did. In the second, a human performs the task with their own hands, the activity is filmed, and the action the robot should have taken is reconstructed afterward. The former is teleoperation; the latter is human video. They occupy opposite ends of a single tradeoff, and selecting between them is among the most consequential decisions in a manipulation data program.

The temptation is to treat this as a binary choice. Advocates of teleoperation note that its action labels are exact and on-embodiment, so the policy learns from ground truth. Advocates of human video note that their pipelines scale to millions of clips while teleoperation remains gated by robot availability. Both claims are correct, and that is precisely the point: the two sources excel at different things, and framing the decision as strictly one-or-the-other forfeits the gains available from combining them.

This post sets out the tradeoff plainly: where teleoperation leads, where human video leads, what the embodiment gap costs in practice, and why the pragmatic answer is typically a hybrid that uses diverse human video for breadth and targeted teleoperation for the last mile.

Teleoperation: exact actions, slower and narrower

In teleoperation, an operator controls the robot through a leader arm, a VR rig, a 3D mouse, or a glove, and the robot's own joints and grippers record the action at every timestep. The result is paired in exactly the form imitation learning requires: this observation, this action, on this exact body. There is no inference step, no morphology mismatch, and no guesswork about contact dynamics. The robot executed the behavior, and the system recorded what it executed.

This fidelity is why teleoperation underpins many of the strongest open manipulation datasets. DROID collected tens of thousands of teleoperated trajectories across many labs and scenes on a shared hardware configuration. BridgeData V2 assembled a broad set of teleoperated kitchen and tabletop tasks specifically so that policies could be trained and benchmarked on consistent, on-embodiment demonstrations. The Open X-Embodiment pool, which aggregates data from many robots and labs, is overwhelmingly teleoperated. When the field requires ground truth, it teleoperates.

The cost surfaces everywhere else. Teleoperation requires a robot, a functioning cell, a trained operator, and time. An operator collects demonstrations in real time, one body at a time, and contact-rich tasks fail and reset frequently. Throughput is capped by the number of robots and operators that can run concurrently, which makes diversity expensive: each new home, lighting condition, or object distribution requires physically relocating a rig or rebuilding a scene. The outcome is pristine labels over a narrow slice of the world.

What teleoperation bakes into the data

A subtler property deserves attention. Teleoperated motion is not how a human would naturally perform the task. It is how a human performs the task through a robot's kinematics, latency, and gripper, frequently with hesitation and mid-trajectory correction. The policy learns that mediated style. This is acceptable precisely because it is on-embodiment, but it means teleoperation does not capture natural human strategy. It captures a human's best attempt to puppet a machine.

Synchronised motion track recorded alongside a manipulation task
Teleoperation records the robot's own joint and gripper trajectory directly, so the action label is exact and requires no inference.

Human video: abundant, diverse, and unlabeled

Human video inverts most of these properties. A person wears or faces a camera and performs the task with their own hands, at their own pace, in their own home. No robot is involved, so capture scales the way filming scales: in parallel, across thousands of people and locations, at a fraction of the cost per hour. The result captures the long tail of how real people perform real tasks, including the natural variation in grip, sequencing, recovery, and improvisation that teleoperation rarely observes.

The research community has invested heavily here for the same reasons. Ego4D assembled thousands of hours of first-person daily activity across hundreds of participants and many countries, a breadth few teleoperation programs could match. EgoExo4D extended this by pairing egocentric and exocentric views of skilled activity, specifically to study how the hands' actions can be recovered. The underlying premise is that broad human video carries information teleoperation cannot, because it depicts the task as humans actually solve it, across many environments, at low cost.

The constraint is contained in the word unlabeled. Human video does not arrive with actions; it arrives as pixels. Training a manipulation policy requires recovering a robot-usable action from each frame: the 3D position of the hand and fingers, their motion over time, and the moments contact begins and ends. This recovery is the central difficulty, and it determines whether the breadth strengthens or quietly weakens the dataset.

The embodiment gap, concretely

The core obstacle is the embodiment gap: a human hand is not a robot gripper, and a human arm is not a robot arm. Even given a near-perfect hand-pose estimate, the motion must be retargeted onto a body with different proportions, different degrees of freedom, a different finger count, and a different reachable workspace. A five-finger pinch has no clean mapping to a parallel-jaw gripper. A motion trivial for a human wrist can be difficult or infeasible for the robot's.

Retargeting is the step that closes this gap, and it is inherently lossy. The pipeline estimates the human hand pose from video, itself noisy under occlusion and motion blur, then solves for a robot configuration that reproduces the salient component of the motion, typically the end-effector path and the grasp. Contact forces are largely invisible in video and are therefore inferred or omitted. Each stage is an opportunity for a confident-looking label to be subtly incorrect, and a policy trained on subtly incorrect actions learns subtly incorrect behavior.

  • Morphology mismatch. Human fingers to robot gripper, human reach to robot reach. Some motions do not transfer cleanly and must be approximated or discarded.
  • Pose-estimation error. Recovering hand and body pose from monocular egocentric video is difficult, particularly under occlusion, fast motion, and atypical lighting.
  • Missing forces. Video records kinematics, not applied force or grip pressure. Force and contact must be inferred, and frequently are not.
  • Viewpoint and calibration. Without a known camera placement, scale and depth are ambiguous, which degrades 3D motion recovery.

None of this renders human video unusable. It renders it raw. The breadth is genuine and valuable, but it pays off only when the capture is consistent enough for retargeting to succeed, which is the difference between varied footage and a genuinely varied dataset.

Teleoperation answers a narrow question precisely; human video poses the right question across many environments and leaves the answer to be recovered.

Motionstack

The tradeoff, side by side

The two sources are best compared along the axes that govern a data program: fidelity of the action label, scale of collection, diversity of the distribution, and cost per useful hour.

  • Fidelity. Teleoperation leads. Actions are exact and on-embodiment. Human video must infer and retarget actions, yielding approximate labels.
  • Scale. Human video leads. Capture runs in parallel across people and locations, whereas teleoperation is gated by robot and operator availability.
  • Diversity. Human video leads. It naturally spans homes, body types, and authentic strategy, while teleoperation diversity incurs a physical rig move each time.
  • Cost per useful hour. Mixed. Human video is far cheaper to capture but carries a downstream labeling and retargeting cost. Teleoperation is more expensive to capture but arrives usable.
  • Naturalness of strategy. Human video leads. It depicts how people genuinely perform tasks, whereas teleoperation depicts how people puppet a robot to approximate them.

Read down the list and the conclusion is clear. Neither source dominates the other; they behave as complements. The operative question is rarely which to use, but how to combine them so that each covers the other's weakness.

2 sources
teleoperation and human video, opposite ends of one tradeoff
4 axes
fidelity, scale, diversity, and cost shape the mix
1 gap
embodiment mismatch is what retargeting works to close

Where each source fits

Teleoperation is the right fit when the last mile matters: precise, contact-rich, on-embodiment behavior where small action errors compound into failure. Threading a connector, pouring without spilling, the final centimeters of a grasp. These tasks require the action label to be exactly correct on the exact robot, and there is rarely a cheaper route to that than driving the robot and recording it.

Human video is the right fit when the bottleneck is coverage: the breadth of locations, people, objects, and natural strategies a policy must generalize across. This is the dimension teleoperation cannot affordably collect, and it often determines whether a policy holds up outside the lab. Broad, diverse human demonstration is how a model learns what is invariant about a task rather than memorizing a single cell. We examine why coverage tends to outweigh raw volume in a companion post; in short, variety across place, person, and task is the decisive lever.

The teams advancing generalist manipulation operate this way. Groups such as Physical Intelligence train on large, heterogeneous mixtures rather than betting on a single clean source, because the mixture is what produces policies that transfer. Diversity in the data does the heavy lifting, while precise on-embodiment data does the finishing.

First-person view of a person performing a cooking task in a home kitchen
Human video captures natural strategy across real homes at a scale teleoperation rarely reaches, which is the principal source of coverage.

The hybrid is the pragmatic answer

The effective strategy is not to choose but to sequence. Diverse human video serves as the breadth layer: pretrain or co-train on a large, varied corpus of real people performing real tasks, so the policy absorbs scene diversity, object variety, and natural strategy at a scale teleoperation cannot fund. Targeted teleoperation, or on-robot fine-tuning, then serves the last mile: a focused set of exact, on-embodiment demonstrations on the specific tasks and the specific robot where fidelity determines success.

This split spends each budget where it returns the most. Human video is inexpensive per hour and buys coverage. Teleoperation is expensive per hour and buys precision. Placing breadth first and precision last avoids paying robot-cell rates to collect diversity and avoids asking a retargeted human hand to nail the final millimeters of a contact-rich grasp. Each source does the work it is genuinely suited to.

The hybrid succeeds only when the human-video layer is built to be retargeted, not merely collected. In practice this requires consistent camera placement and calibration, motion recovered cleanly from the video itself via per-frame hand-pose and ego-motion estimation, and sufficient diversity in the dimensions that matter. Varied footage on arbitrary phones at arbitrary angles is closer to noise than to a breadth layer. The breadth layer must be diverse in content and consistent in capture simultaneously.

Implications for your data spec

When scoping a manipulation program, decide the split deliberately. Identify the tasks where on-embodiment fidelity is non-negotiable and budget teleoperation for those. Specify the coverage required across homes, people, and object distributions, and source it as human video. Define the capture standard for that video so it retargets cleanly, and determine in advance how a held-out slice diverse enough to measure real generalization, rather than in-distribution memorization, will be reserved. See how the pieces fit end to end.

Motionstack supplies the breadth layer: diverse, motion-synced human video from real people in real homes, captured on a consistent rig so it retargets cleanly and complements your targeted teleoperation rather than fighting it. If you are deciding how to allocate your manipulation data budget, tell us the spec and we can field the coverage your policy is missing.

Get the real-world data your robot needs.

Tell us the task, the person, and the place. We field it from a network of 800k contributors and deliver it to spec, cleared for commercial training, in about four weeks.