Engineering·April 8, 2026·11 min read

Bimanual manipulation requires bimanual training data

Everyday household tasks are predominantly two-handed: one hand stabilises while the other acts, objects are handed off, and relative timing is decisive. Single-arm datasets fail to represent this structure.

By Motionstack

Two hands, real data

Consider almost any kitchen task and count the hands. One hand holds the jar while the other twists the lid. One hand pins the onion while the other draws the knife. One hand pulls the fabric taut so the other can fold a clean crease. The tasks we most want a domestic robot to perform are overwhelmingly two-handed, and they are two-handed in a specific way: the hands rarely perform identical actions. They occupy different roles, simultaneously, with a timing that must align.

Most manipulation data fails to represent this. The majority of what is available today is single-arm teleoperation or third-person video, and both flatten the part of the task that matters most. A policy trained on such data can perform well on a benchmark and still fail the moment a task requires one hand to prepare what the other executes. The limitation here is a data gap, not a model gap.

This article examines why bimanual manipulation is structurally harder than single-arm work in ways that are easily overlooked, why conventional data sources discard the coordinated signal, what bimanual data must contain to be useful, and how to source it for humanoids that ship with two arms precisely because the built environment is designed around two hands.

What makes bimanual manipulation hard

Two arms is not twice one arm. The difficulty does not add; it couples. As soon as both hands contact the same object, or two objects that interact, the control problem changes character. Several constraints intensify at once.

Role asymmetry

In most real two-handed tasks the hands are not interchangeable. One hand stabilises while the other acts: gripping the bowl while stirring, holding the bag open while scooping. The stabilising hand performs a continuous and subtle function, applying sufficient force to hold the object still without crushing it and adjusting as the acting hand exerts reaction forces. That hand is seldom the focus of a demonstration and almost never the focus of an edited video, yet its trajectory constitutes roughly half the task.

Coordination and timing

The two hands share a clock. A handoff defines a window in which both hands hold the object at once. A two-handed carry requires the grips to release and re-grip in sequence so the object never falls. Chopping while holding requires the holding hand to retreat a finger-width ahead of each cut. When a dataset records each arm as an independent stream without tight synchronisation, the single most important variable, the relative timing of the two hands, is precisely what is lost.

Contact and force

Bimanual tasks are characterised by closed kinematic chains: both hands and the object form a loop, and forces propagate around it. Opening a jar requires the two hands to work against each other in a controlled fashion. Apply too much force to the lid and it cannot turn; too little and it slips. This is force coordination, and most of it is invisible to a camera. It cannot be recovered from pixels, and a single-arm rig has no second hand from which to read it.

Deformables

A further class of objects changes shape under manipulation: cloth, dough, cables, bags, and packaging. These are almost always handled with two hands, because a single hand can rarely constrain a deformable object on its own. One hand defines a reference point while the other acts relative to it. Deformables are where single-arm policies degrade most rapidly, and they occupy the long tail of household tasks for which assistance is most valuable.

Single-arm data teaches a policy what one hand does. Bimanual data teaches it what two hands do to each other. These are distinct objectives, and the second is the one that folds the laundry.

Motionstack

The tasks that expose the gap

Coordination is easy to invoke in the abstract; it is more instructive to name the tasks, because each stresses a different aspect of the problem. The following reliably expose the limits of single-arm policies:

  • Folding laundry. A deformable object with no fixed shape. One hand pins a corner and maintains tension while the other sweeps a fold. Incorrect timing causes the fabric to bunch. There is no rigid pose to track, only the relationship between the two hands.
  • Opening a jar. A closed kinematic chain under load. One hand grips the body while the other applies torque to the lid, and the grip must tighten as resistance rises. The task is governed by force coordination, little of which is visible to a camera.
  • Chopping while holding. Role asymmetry with a moving deadline. The holding hand retreats incrementally to stay ahead of the blade. The two hands are tightly coupled in time and in space, and an error carries real consequence.
  • Two-handed carrying. Distributed load and sequenced re-grips. A tray, a full pot, a large box. Both grips share the weight and must transfer it without dropping the object, redistributing the load as the body turns.
  • Assembly. Hold, align, insert, fasten. One hand positions a part while the other mates a second part into it, frequently with sub-centimetre alignment. This is the canonical two-handed pattern in manufacturing and household repair.

Each task shares the same structure: a stabilising role, an acting role, and a tight temporal link between them. A dataset that does not capture both hands in synchrony, with their forces and their handoffs, cannot teach any of these tasks. At best it teaches the half of the task the camera happened to frame.

Isometric cutaway of a person folding laundry with both hands, one holding tension while the other folds
Folding is the clearest example: there is no rigid pose to track, only the relationship between a hand that holds tension and a hand that folds.

Why conventional data misses the signal

If two-handed coordination is so central, why is so little of it present in the datasets? Because the two cheapest methods of gathering manipulation data both discard it.

Third-person video, the kind that can be scraped at scale, is recorded for an audience. It is edited, the framing follows the active hand, and the stabilising hand is frequently out of frame or out of focus. There is no synchronised motion, no force, no depth, and the cuts break continuity precisely where a handoff occurs. The footage establishes that a jar was opened but does not reveal how the two hands shared the torque.

Single-arm teleoperation is the other common default, and its limitation is structural rather than budgetary. A one-armed rig can only collect one-armed tasks. Operators select tasks the rig can perform, so the dataset drifts toward problems solvable with a single hand. The genuinely two-handed tasks, where much of the value resides, are filtered out by the hardware before any labelling begins.

Even bimanual teleoperation, which is closer to the correct form, narrows the distribution in its own way. Operators are slower and more deliberate than people performing their own chores, they favour tasks the interface makes comfortable, and a single laboratory's setups cover a thin slice of the world. It is genuine coordinated data, which is a meaningful advantage, but it does not span the full range of how ordinary people use two hands in ordinary settings. That breadth requires real-world human capture.

What good bimanual data captures

The standard should be set by the task itself. A bimanual dataset earns the name only if it records the properties that make the task bimanual in the first place.

First, both hands, aligned to the same frames. The relative timing of the hands carries much of the signal, so both must be read against the same video timeline rather than two independent sensors that can drift apart. Because the motion is derived from the frames themselves, the two hands and the pixels share one timeline by construction, and a handoff stays as well-formed in the labels as it was in the world.

Second, the handoffs, kept intact. The brief window in which both hands hold the same object is the highest-information moment in the clip, and it is the first casualty of an edit or a dropped frame. Capture must keep that window continuous: no cut, no gap, both hands present and tracked through the transfer.

Third, the stabilising hand, captured in full. Not only the hand that draws attention, but the quieter one maintaining tension, applying counter-force, and making micro-adjustments. Its trajectory and, where possible, its contact and force constitute half the demonstration. A capture setup that foregrounds the acting hand records half a task and treats it as complete.

Fourth, consistent capture throughout. Two hands of motion, video, and depth, on a standardised rig, framed so neither hand leaves the field of view at the moment it matters. Diversity of person, place, and task adds value, but only once the capture standard reliably holds both hands in frame and in synchrony on every take.

2 hands
synchronised on one clock so relative timing survives
1 chain
hands plus object form a closed loop, forces flow around it
5 patterns
fold, open, chop-while-hold, carry, assemble

Why this matters for humanoids

The rationale for the humanoid form factor is that the built environment is already designed around a body with two arms and two hands, so a robot of that morphology can integrate into the world without rebuilding it. Doorknobs, jars, drawers, and laundry are designed for two-handed use. That premise holds only if the policy driving the two arms has learned to use them in concert.

This is where the form factor and the data must agree. A robot shipped with two arms but trained on one-armed data results in hardware capable of coordination paired with a policy that never acquired it. Such a robot solves the single-arm slice of the task space and stalls on everything requiring a stabilising hand. The promise of the humanoid, that it can take on genuinely two-handed chores, is as much a promise about the training data as about the joints.

The research that has advanced real bimanual skill has consistently targeted coordinated capture. ALOHA and Mobile ALOHA built a low-cost bimanual teleoperation setup specifically to collect synchronised two-arm demonstrations and learn fine, contact-rich tasks. DROID distributed manipulation collection across many scenes to pursue robustness rather than a single curated laboratory. Open X-Embodiment demonstrated that pooling diverse manipulation data across robots and setups transfers better than any single narrow source. Ego4D showed that first-person video of ordinary people captures the natural two-handed behaviour that staged collection flattens. The common thread is that coordination must be present in the data, not assumed by the model.

How to source it

When scoping a bimanual dataset, write the specification around the coordination rather than the task name. "Fold a shirt" is insufficient; the deliverable that matters defines what must be true of both hands. A useful specification states:

  1. The task in concrete two-handed terms, including which hand stabilises and which acts, and the handoffs of interest.
  2. Both hands' motion recovered per frame from the video, so the two hands and the pixels stay aligned by construction.
  3. The handoff and contact windows kept continuous, with no cuts or dropped frames where the hands meet.
  4. The deformable and force-coupled tasks required, since these are where single-arm policies fail first.
  5. The person and place distribution, the capture standard that keeps both hands in frame throughout, and the modalities to be synchronised.

Once coordination is in the specification, the definition of completion changes. The deliverable is no longer a count of clips but a set of demonstrations in which both hands are present, synchronised, and continuous through every handoff. That is the data from which a two-armed robot can learn two-handed tasks. Review what we capture across the full dataset library, or see how we standardise capture in our solutions.

If you are training a manipulation policy for a two-armed robot and it stalls the moment a task requires both hands, tell us the spec. We field real people performing real two-handed chores, in real homes, on a rig that holds both hands in frame and in synchrony, with the handoffs intact and the rights cleared for commercial training.

Get the real-world data your robot needs.

Tell us the task, the person, and the place. We field it from a network of 800k contributors and deliver it to spec, cleared for commercial training, in about four weeks.