DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

problem

End-to-end Vision-Language-Action (VLA) models treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level continuous actions. This paradigm suffers from two fundamental issues:

Underutilized decision-making: The VLM’s potential for high-level reasoning is squandered when it serves as a passive feature extractor rather than an active decision maker.
Training instability: Low-level action supervision causes representation collapse — the VLM’s rich semantic features degrade and overfit to spurious action patterns. Methods like π0 (Physical Intelligence) and GR00T-N1 (NVIDIA) truncate gradient flows or freeze the VLM backbone entirely, which prevents the VLM from acquiring action-aware dynamics.

Prior approaches and their limitations:

Hierarchical Planners (SayCan, Hi-RT-2, Code as Policies, ReKeP, Gemini Robotics 1.5): Generate text subtasks or executable code via LLM/VLM to guide a separate controller. These create a non-differentiable wall — no action gradients flow back to the foundation model — and incur high deployment latency. Video-generation-based variants (e.g., Du et al. 2023) predict pixel-level goals but suffer from prohibitive inference costs and lack VLM-level semantic knowledge.
End-to-End VLAs (RT-2, OpenVLA, GR00T-N1, π0, CogAct, GR-3): Directly predict continuous actions. Even with auxiliary world modeling objectives, they lack a strict structural bottleneck — the policy can rely on superficial correlations (shortcut learning) rather than truly translating the VLM’s intent.
Auxiliary world model approaches (FLARE, SEER, CoT-VLA): FLARE uses additional query tokens for latent regularization but treats foresight as optional context at execution time. SEER concatenates foresight features with VL features but does not enforce a strict causal dependency. UniCoD uses Mixture-of-Transformers but still allows intent bypass. All suffer from loose coupling between predicted future states and the execution policy.

DIAL resolves this by introducing latent visual foresight as a fully differentiable structural bottleneck in a biologically-inspired dual-system architecture, ensuring every motor command is strictly grounded in the VLM’s reasoning intent.

architecture

DIAL decomposes VLA into two systems communicating through a latent intent bottleneck:

System-2 (Brain): Predictive Intent Synthesis

System-2 uses a pre-trained VLM backbone (Qwen2.5-VL-3B). Given language instruction $l_t$, current visual observation $o_t$, and $N$ learnable query tokens appended to the LLM input:

The ViT encoder extracts visual patches from $o_t$ (producing features in $\mathbb{R}^{N \times d}$ where $N$ = number of ViT patches, preserving spatial structure).
The LLM processes visual patches, language tokens, and learnable queries together.
The output representations of the query tokens pass through an MLP projection head to synthesize the latent intent $x_t \in \mathbb{R}^{N \times d}$.

This latent intent $x_t$ is explicitly constrained to encode visual foresight — it is trained to predict the ViT features of the observation $o_{t+H}$ at $H$ timesteps ahead:

\[\mathcal{L}\_{\text{world}} = \|x\_t - \text{Enc}\_{\text{ViT}}(o\_{t+H})\|\_2^2\]

Both the foresight target and the current observation encoder use the identical frozen pre-trained ViT from the VLM backbone, ensuring strict feature-space consistency.

System-1 (Cerebellum): Latent Inverse Dynamics

System-1 operates as a reactive motor controller:

An independent perceptual pathway using the same frozen ViT extracts features $\text{Enc}_{\text{ViT}}(o_t)$ from the current observation.
A 4-layer self-attention module fully fuses the current visual features with the predictive intent $x_t$, producing a spatially-aware fused representation.
This fused representation serves as the cross-attention condition for a 16-layer Diffusion Transformer (DiT).
The robot’s proprioceptive state $q_t$ is projected via an MLP into a dense feature token and fed directly into the DiT alongside noisy action tokens.

Action generation uses flow matching (optimal transport). Given ground-truth action chunk $A_t = [a_t, a_{t+1}, \ldots, a_{t+H-1}]$ with horizon $H = 16$, time variable $\tau \sim \mathcal{U}[0,1]$, and Gaussian noise $\epsilon \sim \mathcal{N}(0, I)$, the interpolated path is $A_t^\tau = \tau A_t + (1-\tau)\epsilon$:

\[\mathcal{L}\_{\text{fm}}(\theta) = \mathbb{E}\_{\tau, \epsilon}\left[\|V\_\theta(A\_t^\tau \mid x\_t, \text{Enc}\_{\text{ViT}}(o\_t), q\_t, \tau) - (A\_t - \epsilon)\|\_2^2\right]\]

Key Design Principle

System-1 functions as a latent inverse dynamics model: it must resolve the discrepancy between current visual features and predicted latent foresight to generate actions. Unlike traditional inverse dynamics that operate on raw pixels, DIAL resolves state-transition dynamics entirely within a structured latent space. This imposes a hard bottleneck — the policy cannot bypass System-2’s intent.

Overall Policy

\[x\_t = f\_{\text{System-2}}(l\_t, o\_t), \quad A\_t \sim \pi\_{\text{System-1}}(\cdot \mid x\_t, o\_t, q\_t)\]

Parameter Counts

VLM backbone (Qwen2.5-VL-3B): ~3B parameters (frozen ViT + frozen text embeddings; LLM blocks, learnable queries, and MLP head are trainable)
System-1: 4-layer self-attention + 16-layer DiT + proprio MLP (lightweight relative to VLM)
Total trainable parameters are dominated by the Qwen2.5 LLM blocks (~2.8B trainable)

flowchart LR
    lt["language instruction"] --> vlm["VLM Qwen2.5-VL-3B"]
    ot1["current observation"] --> vit1["Frozen ViT"]
    vit1 --> vlm
    vlm --> mlp["MLP Projection"]
    lq["learnable queries"] --> vlm
    mlp --> xt["latent intent x_t"]

    ot2["current observation"] --> vit2["Shared Frozen ViT"]
    vit2 --> sa["4-Layer Self-Attention"]
    xt --> sa
    sa --> dit["16-Layer DiT"]
    qt["proprioceptive state q_t"] --> mlp2["MLP"]
    mlp2 --> dit
    at["action chunk A_t"]

    dit --> at

    otH["future observation o_t+H"] --> vit3["Shared Frozen ViT"]
    vit3 --> loss["L_world MSE Loss"]
    xt --> loss

    style vlm fill:#8B7355
    style xt fill:#A0522D
    style dit fill:#6B8E23
    style at fill:#556B2F

training

Two-Stage Paradigm

Stage 1: Decoupled Warmup — System-2 and System-1 train independently:

System-2: optimized solely via $\mathcal{L}_{\text{world}}$ to master physically-grounded visual foresight. Uses action-free data.
System-1: trained via $\mathcal{L}_{\text{fm}}$ with $x_t$ replaced by ground-truth future features $\text{Enc}_{\text{ViT}}(o_{t+H})$. Learns sensorimotor control under perfect future guidance.

Stage 2: End-to-End Training — Full pipeline unified:

System-1 now conditioned on synthesized $x_t$ from System-2.
Action gradients backpropagate through $x_t$ into trainable VLM parameters.
The foresight reconstruction loss ($\mathcal{L}_{\text{world}}$) regularizes the VLM, preventing representation collapse.

\[\mathcal{L}\_{\text{total}} = \mathcal{L}\_{\text{world}} + \mathcal{L}\_{\text{fm}}\]

Frozen / Trainable Split

Frozen: ViT encoder, text embedding layer of VLM
Trainable: LLM blocks, learnable query tokens, MLP projection head, entire System-1 (self-attention + DiT + proprio MLP)

Simulation Training (RoboCasa GR1 Tabletop)

Full Data regime: 24,000 trajectories (1,000/task), 160,000 steps total (80k warmup + 80k end-to-end)
Few-Shot regime: 2,400 trajectories (100/task), 40,000 steps total (20k warmup + 20k end-to-end)
Human data pre-training (few-shot + human): 27,419 EgoDex trajectories + 2,400 robot trajectories, 40k steps pre-training (20k warmup + 20k end-to-end) + 20k steps end-to-end fine-tuning on robot-only data
Robot state/action: 47-dimensional (29 joint DoF: 14 dual arms + 12 hands + 3 waist; 18 EEF pose dims)
Action chunk horizon: $H = 16$

Real-World Training (IRON-R01-1.11 Humanoid)

State/action space: 50-dimensional (extends simulation with 3-DoF head)
120 robot trajectories per task (laboratory collected)
Pre-training: 160,000 steps on mixed dataset (32k proprietary factory robot trajectories + 30k EgoDex trajectories), split 80k decoupled warmup + 80k end-to-end
Fine-tuning: 2,000 steps task-specific end-to-end

Training Infrastructure

The paper does not explicitly list GPU count or wall-clock training time, but the use of a 3B VLM backbone with DiT-based System-1 suggests training on multi-GPU setups (likely 4–8 A100s given the scale of 160k steps).

evaluation

RoboCasa GR1 Tabletop — Full Data (1,000 trajectories/task, 160k steps)

Results on 24 tasks (18 Pick & Place + 6 Articulated), 50 episodes each:

Method	Pick & Place	Articulated	24-Task Avg
Diffusion Policy	~30	~24	~27
UWM	~40	~38	~39
GR00T-N1.6	~44	~51	47.6
FLARE	~47	~64	55.0
GR00T-Qwen3	~40	~50	43.7
π0-Qwen3	~50	~48	48.8
FAST-Qwen3	~43	~42	42.3
OFT-Qwen3	~46	~50	47.8
DIAL	68.9	74.3	70.2

DIAL achieves 70.2% average, a +15.2 point margin over the next-best (FLARE at 55.0%).

RoboCasa GR1 Tabletop — Few-Shot (100 trajectories/task, 40k steps)

Method	Pick & Place	Articulated	24-Task Avg
GR00T-Qwen2.5 (frozen)	—	—	21.8
GR00T-Qwen2.5-FT	—	—	30.6
GR00T-Qwen2.5+FLARE	—	—	51.9
GR00T-Qwen2.5+SEER	—	—	49.6
GR00T-Qwen2.5+SEER-EV	—	—	47.2
DIAL-DINO	—	—	47.2
DIAL	56.0	64.7	58.3

Key finding: DIAL at 58.3% (100 trajectories/task) surpasses FLARE at 55.0% trained with 10× more data (1,000 trajectories/task).

Scalability with Human Data (Few-Shot + EgoDex)

Setting	In-Dist Pick & Place	In-Dist Articulated	In-Dist Avg	OOD Unseen Appearance	OOD Unseen Combos	OOD Unseen Objects	OOD Avg
DIAL w/o Human Data	56.0	65.3	60.8 → note	50.7	53.0	34.8	46.2
DIAL + Human Data	60.8	62.0	61.1	53.8	58.7	41.1	51.2

Human data boosts OOD average from 46.2% to 51.2%. Articulated tasks show no improvement due to domain mismatch (EgoDex lacks articulated demonstrations).

Real-World (IRON-R01-1.11) — In-Distribution

Method	Pick & Place	Pour	Avg
GR00T-Qwen2.5	~30	~17	~25
GR00T-Qwen2.5+FLARE	~40	~30	~35
DIAL w/o Human Data	~50	~40	~45
DIAL w/o Decoupled Warmup	~55	~55	~57.5
DIAL	80	70	77.5

Removing decoupled warmup causes in-distribution performance to drop from 77.5% to 57.5%.

Real-World — Out-of-Distribution

Method	Combinatorial	Distractor	Instance-Level	Avg
GR00T-Qwen2.5	~10	~10	~10	~10
GR00T-Qwen2.5+FLARE	~20	~20	~5	~17.5
DIAL w/o Human Data	~20	~20	~30	~26.7
DIAL w/o Decoupled Warmup	~30	~30	~30	~30.0
DIAL	60	40	60	58.3

Where DIAL Wins

Data efficiency: 10× fewer demonstrations to match or exceed prior methods
Structured grounding: Strict bottleneck prevents shortcut learning
OOD generalization: Strong zero-shot transfer to unseen objects, textures, combinations
Cross-embodiment scaling: Effectively absorbs knowledge from human demonstrations
Articulated tasks: 74.3% vs 55.0% (FLARE) in full-data setting

Where DIAL is Weaker

Articulated tasks with human data: No improvement (62.0% vs 65.3% without) when human data lacks relevant demonstrations
Latent space dependency: Performance drops significantly with DINO-v2 features (47.2% vs 58.3%) — requires native VLM feature consistency

reproduction guide

Environment Setup

# Clone the repository (not yet released — check project page)
git clone https://github.com/xpeng-robotics/dial.git
cd dial

# Create conda environment
conda create -n dial python=3.10 -y
conda activate dial

# Install PyTorch (CUDA 12.1)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt
# Expected key deps: transformers, diffusers, flow-matching, robosuite, robocasa

Data Preparation

Simulation (RoboCasa GR1 Tabletop):

# Install RoboCasa
pip install robocasa

# Download GR1 tabletop dataset — 24 tasks, 1000 trajectories each (full) or 100 each (few-shot)
# Follow RoboCasa docs for dataset download
python tools/download_robocasa_data.py --tasks all --split full
# or for few-shot:
python tools/download_robocasa_data.py --tasks all --split fewshot --traj_per_task 100

Human Data (EgoDex):

# Download EgoDex basic_pick_place subset (27,419 trajectories)
# and pour subset (3,205 trajectories)
# Extract wrist EEF poses, pad to match robot state dimension
python tools/process_egodex.py --subset basic_pick_place --output_dir data/egodex

Training — Simulation Few-Shot

# Stage 1: Decoupled Warmup (20,000 steps)
python train.py \
  --config configs/dial_robocasa_fewshot.yaml \
  --stage warmup \
  --num_steps 20000 \
  --batch_size 256 \
  --lr 1e-4 \
  --data_dir data/robocasa_fewshot

# Stage 2: End-to-End Training (20,000 steps)
python train.py \
  --config configs/dial_robocasa_fewshot.yaml \
  --stage endtoend \
  --num_steps 20000 \
  --batch_size 256 \
  --lr 1e-4 \
  --warmup_ckpt checkpoints/warmup/latest.pt \
  --data_dir data/robocasa_fewshot

Training — Few-Shot + Human Data

# Pre-training: 40k steps (20k warmup + 20k end-to-end) on mixed human + robot data
python train.py \
  --config configs/dial_robocasa_human.yaml \
  --stage pretrain \
  --num_steps 40000 \
  --human_data_dir data/egodex/basic_pick_place \
  --robot_data_dir data/robocasa_fewshot

# Fine-tuning: 20k steps end-to-end on robot-only data
python train.py \
  --config configs/dial_robocasa_human.yaml \
  --stage finetune \
  --num_steps 20000 \
  --pretrain_ckpt checkpoints/pretrain/latest.pt \
  --robot_data_dir data/robocasa_fewshot

Evaluation

# Evaluate on RoboCasa GR1 Tabletop (24 tasks, 50 episodes each)
python eval.py \
  --config configs/dial_robocasa_fewshot.yaml \
  --ckpt checkpoints/endtoend/latest.pt \
  --num_episodes 50 \
  --tasks all

# Expected output: ~58.3% average success rate (few-shot), ~70.2% (full data)

Verification Checklist

Confirm System-2 warmup produces valid latent foresight by computing MSE against ground-truth ViT features — should decrease steadily.
Verify System-1 with ground-truth future guidance achieves stable action prediction during warmup.
After end-to-end training, check that $\mathcal{L}_{\text{fm}}$ gradients propagate through $x_t$ to VLM parameters.
Run PCA visualization of predicted foresight vs ground-truth future vs current observation to confirm semantic alignment (see Figure 12 in paper).
Compare against frozen VLM baseline — should see >2× improvement in success rate.

Hardware Requirements

Minimum: 4× A100 80GB GPUs (for 3B VLM + DiT)
Recommended: 8× A100 80GB for full-data regime with larger batch sizes
Storage: ~500GB for RoboCasa full dataset + EgoDex

notes

Key Takeaways

Structural bottleneck > auxiliary loss: The core insight is that merely adding a world modeling objective (like FLARE) is insufficient — the architecture must structurally force the policy to depend on the VLM’s intent. DIAL’s inverse dynamics formulation achieves this.
Latent-space consistency matters: The DINO-v2 ablation (58.3% → 47.2%) shows that the latent intent must live in the VLM’s native feature space. Cross-manifold translation destroys the semantic-physical alignment.
Decoupled warmup is critical: Removing it causes a 20-point drop in real-world OOD performance (58.3% → 30.0%). The warmup prevents posterior collapse by letting System-1 learn under perfect guidance before encountering noisy predictions.
Action-aware foresight: The end-to-end gradients transform $x_t$ from pure visual prediction into a task-oriented representation optimized for downstream motor execution — this is DIAL’s unique mechanism.
Cross-embodiment scaling works: 27k human egocentric trajectories meaningfully improve zero-shot generalization to novel objects (+5 OOD points), validating the action-free world modeling pre-training paradigm.

Connections to Other Work

FLARE (Zheng et al. 2025): Closest predecessor — also uses flow matching with future latent alignment. DIAL differs by enforcing a structural bottleneck (inverse dynamics) rather than treating foresight as optional context. FLARE achieves 55.0% vs DIAL’s 70.2% on full-data RoboCasa.
SEER (Tian et al. 2024): Predictive inverse dynamics model that concatenates foresight features. DIAL improves on this by operating entirely within the VLM’s native latent space and using a decoupled warmup.
π0 (Black et al. 2024): Dual-system VLA with flow matching DiT. π0 freezes the VLM entirely; DIAL makes the LLM blocks trainable through controlled gradient flow, achieving better intent-action alignment.
GR00T-N1.6 (NVIDIA): Upgraded GR00T with larger DiT and fine-tuned late VLM layers. DIAL’s 70.2% vs GR00T-N1.6’s 47.6% on RoboCasa highlights the value of explicit world modeling.
CoT-VLA (Zhao et al. 2025): Uses visual chain-of-thought for reasoning. Requires costly annotations and introduces inference latency. DIAL achieves implicit reasoning through latent foresight without annotation overhead.
UniCoD (Zhang et al. 2025): Mixture-of-Transformers for unified world prediction and action. DIAL’s explicit System-1/System-2 split provides cleaner decoupling.
Latent Action Pretraining (Ye et al. 2025, ICLR): Learns discrete latent actions from video. DIAL operates in continuous latent space with explicit intent-action grounding.
WorldVLA (Cen et al. 2025): Autoregressive action world model. DIAL avoids autoregressive tokenization overhead by using flow matching in continuous latent space.
Biological inspiration: The System-2/System-1 framing directly parallels Kahneman’s dual-process theory — deliberative reasoning vs. fast automatic responses. This cognitive grounding provides an intuitive explanation for the architecture’s effectiveness.

Future Directions (from paper)

Scaling System-1 DiT to larger parameter sizes
End-to-end fine-tuning of the ViT backbone (stabilized via EMA-based encoding and latent token compression)
Pre-training on massive action-free human videos (YouTube-scale) to build truly generalist embodied agents
Integrating latent world modeling into native VLM pre-training objectives
Modular iteration: pre-train System-1 once, swap in new VLM generations without retraining the action expert