Motivation
Dual-Arm
Break Banana (Single-Arm)
Break Banana (Dual-Arm)
Pull Piston (Single-Arm)
Pull Piston (Dual-Arm)
Push Piston (Single-Arm)
Push Piston (Dual-Arm)
Dual-Hand
Fetch Book (Gripper)
Fetch Book (Hand)
Flip Book (Gripper)
Flip Book (Hand)
Roll Dough (Gripper)
Roll Dough (Hand)
Use Pen (Gripper)
Use Pen (Hand)
Dexterous
Twist Cap (6-DoF Hand)
Twist Cap (12-DoF Hand)
Teleoperation System
The operator teleoperates the physical robot and its MujoCo digital twin, so apple→plate demonstrations are collected in real and simulation under the same interface, thereby reducing the sim-to-real gap.
Low-latency teleoperation demonstration: This video presents a rapid teleoperation scenario, where the human operator performs fast, abrupt motions to evaluate the system's end-to-end responsiveness. The robot's real-time tracking of the operator's movements highlights the extremely low latency of our teleoperation pipeline. Quantitative analysis shows the system achieves an average end-to-end delay of only 11 ms, enabling highly responsive and precise manipulation even under aggressive teleoperation conditions.
Dataset
Simulation
297 objects
30 categories
200 tasks
100K trajectories
6.5M frames
361h video
Open Source
Lerobot v2.1 format
Real World
347 objects
17 categories
200 tasks
10K trajectories
3.2M frames
177.5h video
Open Source
Lerobot v2.1 format
Method
(a) Data filtering: From the real-world dataset we pre-screen demonstrations by kinematic smoothness (low acceleration and jerk), then replay them for post-validation and keep the clips that complete the task without collisions, forming a high-quality subset.
(b) Discriminator training: With the pretrained diffusion–transformer policy frozen, we compute a log-π proxy for each clip and train a discriminator that, conditioned on observations and language, outputs a quality score d(Ct) ∈ (0,1].
(c) Data-quality-aware post-training: During post-training, the score d(Ct) is converted to weights wi and used in the diffusion loss Lπ. At inference time, only the policy is used.
Basic Tasks Demonstration
Pick and Place
Place Apple on Plate
Place Bowl into Bowl
Put Two Eggs into Box (Bimanual)
Lift Basket (Bimanual)
Move Block to Right Plate (Bimanual)
Assemble/Disassemble
Stack Ring Blocks
Grab Square Blocks
Place Kettle on Base
Remove Pen Cap (Bimanual)
Separate Nested Bowls (Bimanual)
Articulated Object
Open Cabinet Door
Open Laptop (Bimanual)
Extra Tasks
Dexterous manipulation sequences
Cut Leek
Use Pen
Roll Dough
Twist Cap
Fetch Book
Place Plates
Out-of-Distribution Generalization
Unseen Background
Unseen Lighting
Unseen Object
Occlusion
Clutter
Height Change
Cross-embodiment Generalization
Single Gripper
Single Hand
Dual Gripper
Effect of the Data-Quality Discriminator
w/o Discriminator
Shaking; Fail
w/ Discriminator
Smooth; Succeed
Joint Curve
Analysis
w/o Discriminator
Shaking; Fail
w/ Discriminator
Smooth; Succeed
Joint Curve
Analysis
Failure Cases
Semantic Misgrounding due to Instruction Ambiguity
The language-conditioned policy misinterprets the instruction “put the apple into the bowl”, confusing the referent ("smal bowl" vs. "big bowl")), and thus selects the incorrect target for placement.
Visual Misclassification under Object Similarity
The vision encoder, when executing the instruction “put the red apple into the bowl”, confuses visually similar red objects (apple vs. tomato) under mixed lighting, causing a target identification error.
Grasp Instability
The grasp lacks sufficient frictional stability or normal force, leading to slippage.
Bimanual Coordination Failure
Desynchronized joint trajectories between the two manipulators result in torque imbalance and object tilt.
Pose Drift and Accumulated Alignment Error
Small initial orientation deviation propagates across sequential steps, causing final assembly failure.
Workspace Boundary Violation
The target lies beyond feasible workspace, leading to joint saturation.