Dexora: Open-source VLA for High-DoF Bimanual Dexterity

DEXORA

We introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. Leveraging a hybrid teleoperation pipeline and a comprehensive dataset, Dexora is trained to handle both basic and dexterous tasks with remarkable performance. Our approach unifies dual-arm coordination and fine finger control, opening the path for universal controllers across varying embodiments.

Motivation

Existing systems are fundamentally limited: they are either designed for dual-arm grippers or single-arm dexterous hands, but not both. This restricts prior VLAs from handling tasks that require dual-arm coordination (e.g., piston insertion) or high-DoF dexterous fingers (e.g., bottle opening, complex book retrieval). Dexora is the first open-source VLA system that bridges this gap by unifying dual-arm, dual-hand, and high-DoF dexterity in a single framework, enabling a wide range of complex manipulation tasks.

Dual-Arm

×4

Break Banana (Single-Arm)

×4

Break Banana (Dual-Arm)

×4

Pull Piston (Single-Arm)

×4

Pull Piston (Dual-Arm)

×4

Push Piston (Single-Arm)

×4

Push Piston (Dual-Arm)

Dual-Hand

×4

Fetch Book (Gripper)

×4

Fetch Book (Hand)

×4

Flip Book (Gripper)

×4

Flip Book (Hand)

×4

Roll Dough (Gripper)

×4

Roll Dough (Hand)

×4

Use Pen (Gripper)

×4

Use Pen (Hand)

Dexterous

Twist Cap (6-DoF Hand)

×4

Twist Cap (12-DoF Hand)

Teleoperation System

The operator teleoperates the physical robot and its MujoCo digital twin, so apple→plate demonstrations are collected in real and simulation under the same interface, thereby reducing the sim-to-real gap.

×1

Low-latency teleoperation demonstration: This video presents a rapid teleoperation scenario, where the human operator performs fast, abrupt motions to evaluate the system's end-to-end responsiveness. The robot's real-time tracking of the operator's movements highlights the extremely low latency of our teleoperation pipeline. Quantitative analysis shows the system achieves an average end-to-end delay of only 11 ms, enabling highly responsive and precise manipulation even under aggressive teleoperation conditions.

Dataset

Simulation

×2

297 objects

30 categories

200 tasks

100K trajectories

6.5M frames

361h video

Open Source

Lerobot v2.1 format

Real World

×2

347 objects

17 categories

200 tasks

10K trajectories

3.2M frames

177.5h video

Open Source

Lerobot v2.1 format

Method

(a) Data filtering: From the real-world dataset we pre-screen demonstrations by kinematic smoothness (low acceleration and jerk), then replay them for post-validation and keep the clips that complete the task without collisions, forming a high-quality subset.

(b) Discriminator training: With the pretrained diffusion–transformer policy frozen, we compute a log-π proxy for each clip and train a discriminator that, conditioned on observations and language, outputs a quality score d(C_t) ∈ (0,1].

(c) Data-quality-aware post-training: During post-training, the score d(C_t) is converted to weights w_i and used in the diffusion loss L_π. At inference time, only the policy is used.

Basic Tasks Demonstration

Pick and Place

×2

Place Apple on Plate

×2

Place Bowl into Bowl

×2

Put Two Eggs into Box (Bimanual)

×2

Lift Basket (Bimanual)

×2

Move Block to Right Plate (Bimanual)

Assemble/Disassemble

×2

Stack Ring Blocks

×2

Grab Square Blocks

×2

Place Kettle on Base

×2

Remove Pen Cap (Bimanual)

×2

Separate Nested Bowls (Bimanual)

Articulated Object

×2

Open Cabinet Door

×4

Open Laptop (Bimanual)

Extra Tasks

×2

Flip Hourglass

×2

Lift Beer Glass

×2

Place Blocks in Plate (Bimanual)

×2

Place Cake

×2

Place Cakes on Plate (Bimanual)

×2

Place Ice Cream in Bowl

×2

Place Cubes in White Lid (Bimanual)

×2

Remove Eggs (Bimanual)

×2

Stack Blocks(Bimanual)

×2

Unstack Ring Block

×2

Wrap Cup with Tape

Dexterous manipulation sequences

×2

Cut Leek

×2

Use Pen

×2

Roll Dough

×2

Twist Cap

×2

Fetch Book

×2

Place Plates

Out-of-Distribution Generalization

We test OOD robustness on the “Pick apple to the plate” task across six conditions: unseen background, unseen lighting, unseen object, occlusion, clutter, and height change.

×2

Unseen Background

×2

Unseen Lighting

×2

Unseen Object

×2

Occlusion

×2

Clutter

×2

Height Change

Cross-embodiment Generalization

Across three embodiments (single-arm gripper, dual-arm grippers, single-arm dexterous hand), we find high-to-low transfer is straightforward, while lifting to high-DoF remains challenging.

Single Gripper

×2

Place Ice Cream

×2

Stack Bowls

×2

Pour Water

×2

Stack Blocks

×2

Place Cakes on Plate

Single Hand

×2

Place Ice Cream

×2

Stack Bowls

×2

Pour Water

×2

Stack Blocks

×2

Place Cakes on Plate

Dual Gripper

×2

Place Ice Cream

×2

Stack Bowls

×2

Pour Water

×2

Stack Blocks

×2

Place Cakes on Plate

×2

Pull Drawer

×2

Pass Pepper

Effect of the Data-Quality Discriminator

In the single-hand corn-to-plate task and the bimanual basket-lifting task, quality-aware post-training significantly improves motion stability: with the discriminator, joint trajectories are smoother and tasks succeed, whereas without it, oscillations lead to object drops.

w/o Discriminator

×1

Shaking; Fail

w/ Discriminator

×1

Smooth; Succeed

Joint Curve

×1

Analysis

w/o Discriminator

×1

Shaking; Fail

w/ Discriminator

×1

Smooth; Succeed

Joint Curve

×1

Analysis

Failure Cases

Despite the overall robustness of Dexora-VLA, certain failure cases reveal fundamental challenges in semantic grounding, physical interaction modeling, and cross-arm coordination. These examples highlight open problems in large-scale Vision–Language–Action learning for high-DoF bimanual manipulation.

×2

Semantic Misgrounding due to Instruction Ambiguity

The language-conditioned policy misinterprets the instruction “put the apple into the bowl”, confusing the referent ("smal bowl" vs. "big bowl")), and thus selects the incorrect target for placement.

×2

Visual Misclassification under Object Similarity

The vision encoder, when executing the instruction “put the red apple into the bowl”, confuses visually similar red objects (apple vs. tomato) under mixed lighting, causing a target identification error.