CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Problem

CLaD tackles the problem of long-horizon robotic manipulation planning where a robot must execute multi-step sequential tasks (2–3 subtasks per episode) involving coupled kinematic transitions (robot joint/velocity changes) and semantic transitions (visual scene changes). The core insight is that these two transition types are causally coupled through underlying actions — when a robot closes its gripper (kinematic), the visual scene updates (semantic) — and prior methods fail to explicitly model this coupling during planning.

Prior Art and Limitations

Semantic artifact generation approaches produce explicit visual or textual subgoals during planning but suffer from computational overhead:

SayCan [Ahn et al., 2022] grounds language in affordances for feasible action selection, but is limited to skill-level selection without fine-grained control.
Chain-of-thought methods [Zawalski et al., 2025; CoT-VLA, Zhao et al., 2025] generate explicit reasoning steps to decompose tasks, introducing expensive iterative text generation.
DiffusionVLA [Wen et al., 2025] combines autoregressive textual reasoning with diffusion-based action generation, but the iterative textual reasoning creates computational bottlenecks.
SuSIE [Black et al., 2024] uses an image-editing diffusion model to propose visual subgoals for a low-level policy (76.3% on LIBERO-LONG, 0.86B params), but image generation is computationally expensive and struggles with complex multi-step tasks.
Seer [Tian et al., 2025] predicts robot actions via inverse dynamics conditioned on forecasted future visual states (87.7%, 0.32B params), but video prediction-based methods are inherently noisy.
Video generation methods [Du et al., 2023; Wu et al., 2024] use text-conditioned video generation to guide low-level policies, but face significant computational cost from iterative generation.

Latent-space planning approaches are more efficient but lack cross-modal constraints:

TD-MPC2 [Hansen et al., 2024] performs model-predictive control in a decoder-free latent space, but implicitly conflates semantic and kinematic information without ensuring cross-modal consistency during rollouts.
DreamerV3 [Hafner et al., 2025] learns world models in recurrent state-space models, but focuses on single-step dynamics without cross-modal alignment.
LBP [Liu et al., 2025] performs latent space backward planning from goal to current state (88.6%, 0.19B params), but shows pronounced degradation on tasks requiring perceptually similar object discrimination (Task 9: 60.0%).
UVA [Li et al., 2025] operates in a unified video-action latent space (0.5B params), but similarly lacks explicit cross-modal transition modeling.

Cross-modal representation methods align modalities at individual timesteps, not transitions:

R3M [Nair et al., 2022] and DecisionNCE [Li et al., 2024] learn vision-language correspondences via contrastive objectives at the observation level.
RoboMimic/MAE [Radosavovic et al., 2023] and RPT [Wang et al., 2024] jointly embed robot kinematic states and visual observations via heterogeneous transformers, but model cross-modal correspondences at single timesteps only, without capturing how semantic and kinematic states jointly change under actions.

Key gap: No existing method explicitly models how proprioceptive and semantic transitions co-evolve under actions, leading to latent representations that can decouple during rollout and produce physically or logically inconsistent trajectories.

Architecture

CLaD is a two-stage framework: Stage 1 learns cross-modal latent dynamics and predicts grounded latent foresights; Stage 2 conditions a diffusion policy on these foresights for action generation. Total parameters: 0.66B (VLM: 0.1B, CLaD dynamics: 0.33B, Policy: 0.23B).

State Encoding. At each timestep $t$, the proprioceptive state $p_t \in \mathbb{R}^{D_p}$ (joint angles and velocities) and semantic state $s_t = \text{FiLM}(v_t, l) \in \mathbb{R}^{D_s}$ are encoded via MLP encoders with hidden dimension $H = 1024$:

\[\bar{p}\_t = f\_p(p\_t) \in \mathbb{R}^{N\_p \times H}, \quad \bar{s}\_t = f\_s(s\_t) \in \mathbb{R}^{N\_s \times H}\]

where $f_p: \mathbb{R}^{D_p} \to H$ and $f_s: \mathbb{R}^{D_s} \to H$ are MLPs, and outputs are tokenized into sequences of length $N_p = N_s = 4$. The semantic state uses a FiLM layer [Perez et al., 2018] to fuse vision-language embeddings $v_t, l$ from a frozen DecisionNCE VLM [Li et al., 2024].

Transition Embedding. To extract how each modality changes over the action horizon $\tau = 6$, modality-specific cross-attention modules produce transition representations by conditioning current states on past states and action sequences:

\[z^t\_p = \text{CrossAttn}(\bar{p}\_t, [\bar{p}\_{t-\tau}; a\_{t-\tau:t}]) \in \mathbb{R}^{N\_p \times H}\] \[z^t\_s = \text{CrossAttn}(\bar{s}\_{t-\tau}, [\bar{s}\_{t-\tau}; a\_{t-\tau:t}]) \in \mathbb{R}^{N\_s \times H}\]

where $[\,;\,]$ denotes concatenation over token dimension and $a_{t-\tau:t} = f_a(a_{t-\tau:t})$ is the encoded action sequence. During training, action tokens are stochastically replaced with a learnable token (masking ratio $r = 0.3$), encouraging the model to infer transitions from state differences for robustness.

Asymmetric Cross-Modal Dynamics. The core of CLaD: proprioceptive transitions query semantic transitions via asymmetric cross-attention:

\[z\_{p \to s} = \text{CrossAttn}(z\_p, z\_s) \in \mathbb{R}^{N\_p \times H}\]

This means kinematic transitions (e.g., “my arm is extending”) attend over semantic transitions (e.g., “the scene is changing”) to interpret visual scene changes through the robot’s kinematic context. Ablations confirm this direction (94.7%) outperforms the reverse (93.8%) and symmetric self-attention (86.7%).

Learnable Pooling to Dynamics Vector. The cross-modal output is pooled into a compact representation:

\[z\_{\text{dyn}} = \text{Pool}(q\_{\text{out}}, z\_{p \to s}) \in \mathbb{R}^H\]

where $\text{Pool}$ is a learnable latent query mechanism [Jaegle et al., 2021] that maps inputs into a learnable query $q_{\text{out}} \in \mathbb{R}^H$, encouraging the model to capture salient cross-modal dynamics patterns rather than relying on mean/max pooling.

Latent Foresight Prediction. Lightweight MLP decoders $g_p: \mathbb{R}^H \to \mathbb{R}^H$ and $g_s: \mathbb{R}^H \to \mathbb{R}^H$ predict future latent states from $z_{\text{dyn}}$:

\[\hat{z}^{t+\tau}\_p = g\_p(z\_{\text{dyn}}) \in \mathbb{R}^H\] \[\hat{z}^{t+\tau}\_s = g\_s(z\_{\text{dyn}}) \in \mathbb{R}^H\] \[\hat{z}\_{t+\tau} = [\hat{z}^{t+\tau}\_p\, ;\, \hat{z}^{t+\tau}\_s]\]

EMA Target Encoders (momentum $m = 0.995$) provide stable targets to prevent representation collapse:

\[\theta\_{\text{target}} \leftarrow m \cdot \theta\_{\text{target}} + (1 - m) \cdot \theta\] \[\bar{z}^{t+\tau}\_p = f^{\text{target}}\_p(p\_{t+\tau}), \quad \bar{z}^{t+\tau}\_s = f^{\text{target}}\_s(s\_{t+\tau})\]

Loss Functions:

Latent prediction loss — MSE on L2-normalized embeddings, constraining to a unit hypersphere:

\[\mathcal{L}\_{\text{latent}} = \left\| \frac{\hat{z}^{t+\tau}\_p}{\|\hat{z}^{t+\tau}\_p\|\_2} - \frac{\bar{z}^{t+\tau}\_p}{\|\bar{z}^{t+\tau}\_p\|\_2} \right\|^2 + \left\| \frac{\hat{z}^{t+\tau}\_s}{\|\hat{z}^{t+\tau}\_s\|\_2} - \frac{\bar{z}^{t+\tau}\_s}{\|\bar{z}^{t+\tau}\_s\|\_2} \right\|^2\]

Auxiliary reconstruction loss — L1 losses with lightweight decoders $h_p: \mathbb{R}^H \to \mathbb{R}^{D_p}$, $h_s: \mathbb{R}^H \to \mathbb{R}^{D_s}$ anchor latent representations to observable quantities:

\[\mathcal{L}\_{\text{recon}} = \|h\_p(\hat{z}^{t+\tau}\_p) - p\_{t+\tau}\|\_1 + \|h\_s(\hat{z}^{t+\tau}\_s) - s^v\_{t+\tau}\|\_1\]

Combined Stage 1 loss:

\[\mathcal{L} = \mathcal{L}\_{\text{latent}} + \lambda\_{\text{recon}} \mathcal{L}\_{\text{recon}}, \quad \lambda\_{\text{recon}} = 0.1\]

The reconstruction loss is not merely a regularizer — removing it drops performance from 94.7% to 86.1% (–8.6%). UMAP visualization shows that without $\mathcal{L}_{\text{recon}}$, task-specific clusters in $z_{\text{dyn}}$ become diffuse and overlapping.

Stage 2: Foresight-Conditioned Diffusion Policy

Given predicted foresights $\hat{z}^{t+\tau}_p$, $\hat{z}^{t+\tau}_s$ from the frozen CLaD module, current observations are encoded via modality-specific encoders $e_p$, $e_s$:

\[o^p\_t = e\_p(p\_t), \quad o^s\_t = e\_s(s^v\_t, s^l\_t)\]

Each foresight is FiLM-modulated with its corresponding current observation, where observations supply affine scale and shift parameters:

\[g\_p = \text{FiLM}(\hat{z}^{t+\tau}\_p, o^p\_t), \quad g\_s = \text{FiLM}(\hat{z}^{t+\tau}\_s, o^s\_t)\]

The policy uses standard DDPM noise-prediction trained to denoise action sequence $a_{0:H}$ conditioned on observations and foresights:

\[\mathcal{L}\_{\text{policy}} = \mathbb{E}_{a\_0, k, \epsilon}\left[ \left\| \epsilon - \hat{\epsilon}\_\theta(a\_k, k, g\_p, g\_s) \right\|^2 \right]\]

where $a_k = \sqrt{\bar{\alpha}_k}\, a_0 + \sqrt{1 - \bar{\alpha}_k}\, \epsilon$ is the noised action at diffusion step $k$ and $\epsilon \sim \mathcal{N}(0, I)$.

Parameter Budget

Component	Parameters
VLM (DecisionNCE, frozen)	0.1B
CLaD dynamics model	0.33B
Diffusion policy	0.23B
Total	0.66B

Training

Hardware and Time

GPU: Single NVIDIA RTX 4090
Stage 1 (dynamics + foresight): 25,000 steps, batch size 128, 2 hours
Stage 2 (diffusion policy): 200,000 steps, batch size 128, 20 hours
Total training time: ~22 hours on a single GPU
Inference memory: 4 GB
Inference speed: 25 Hz (action chunk generation)
Planning latency: 0.012 s per step

Dataset

LIBERO-LONG [Liu et al., 2023]: 10 long-horizon manipulation tasks in kitchen/tabletop environments, each requiring 2–3 sequential subtasks. 50 demonstrations per task. Follows standard training protocol with top-3 checkpoint averaging over 20 rollouts.

Key Hyperparameters

Hyperparameter	Value
Hidden dimension $H$	1024
Learnable tokens $N_p, N_s$	4 each
Action horizon $\tau$	6
Batch size	128
EMA momentum $m$	0.995
$\lambda_{\text{recon}}$	0.1
Action masking ratio $r$	0.3
VLM	DecisionNCE (frozen)

Special Tricks

Stochastic action masking ($r = 0.3$): During training, action tokens are randomly replaced with a learnable token, similar to masked autoencoders [He et al., 2022]. This encourages the model to infer transitions from state differences alone, improving robustness over actions. Increasing to $r = 0.9$ (heavy mask) or training action-free both degrade performance (88.2% and 85.1% respectively vs. 94.7% baseline).
EMA target encoders prevent the online encoder from chasing a moving target, following BYOL [Grill et al., 2020].
Two-stage decoupling: Stage 1 focuses purely on accurate future state prediction without policy optimization bias; Stage 2 focuses on policy learning with frozen foresights.
L2 normalization in latent loss: Constraining embeddings to a unit hypersphere prevents magnitude collapse while preserving angular relationships encoding semantic similarity.
Learnable pooling (Perceiver-style [Jaegle et al., 2021]) rather than mean/max pooling, encouraging extraction of salient cross-modal dynamics patterns.

Evaluation

Main Benchmark: LIBERO-LONG (10 tasks, 20 rollouts × top-3 checkpoints)

Method	Params (B)	Avg SR (%)	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10
SuSIE	0.86	76.3	83.3	63.3	96.6	100.0	83.3	83.3	83.3	39.9	53.3	76.6
Seer	0.32	87.7	88.3	90.0	98.3	100.0	91.7	93.3	85.0	91.7	61.7	71.7
LBP	0.19	88.6	90.0	100.0	100.0	76.6	86.6	90.0	86.6	60.0	96.6	—
$\pi$0	3.3	82.0	74.0	94.0	88.0	22.0	100.0	88.0	98.0	86.0	76.0	94.0
$\pi$0.5	3.3	93.2	92.0	98.0	98.0	98.0	100.0	94.0	96.0	100.0	62.0	94.0
OpenVLA	7.0	93.8	98.0	96.0	100.0	78.0	84.0	100.0	94.0	96.0	92.0	100.0
CLaD (†)	0.66	94.5	100.0	100.0	98.3	93.7	92.0	100.0	94.3	100.0	81.7	85.0
CLaD (‡)	0.66	94.7	100.0	100.0	98.0	94.0	91.0	100.0	95.0	100.0	82.0	87.0

† = top-3 checkpoint avg over 20 rollouts; ‡ = single checkpoint, 50 rollouts.

Where CLaD Wins

Overall SR: 94.7% beats OpenVLA (93.8%, 7B) by 0.9 pp, $\pi$0.5 (93.2%, 3.3B) by 1.5 pp, with 10× fewer parameters than OpenVLA.
Vs. similar-scale methods: +18.4 pp over SuSIE (0.86B), +7.0 pp over Seer (0.32B), +6.1 pp over LBP (0.19B).
Stability: Achieves 100% on 4/10 tasks (T1, T2, T3, T7). Only $\pi$0.5 also achieves 100% on 4 tasks.
Efficiency: 25 Hz inference vs. OpenVLA 6 Hz and $\pi$0.5 10 Hz. Only 4 GB memory vs. 15 GB / 19 GB.

Where CLaD Loses

Task 4 (bowl in drawer): 94.0% vs. SuSIE 100% and LBP 76.6%
Task 5 (mugs on plates): 91.0% vs. Seer 91.7%, $\pi$0.5 100%, LBP 86.6%
Task 6 (book placement): 100% tied with SuSIE and LBP, beats OpenVLA 84%
Task 9 (both pots on stove — perceptually similar objects): 82.0% — weakest task for CLaD, but still beats LBP (60.0%) and $\pi$0.5 (62.0%). This task is challenging due to perceptually similar objects requiring fine-grained discrimination.
Task 10 (mug in microwave): 87.0% vs. OpenVLA 100%
Generalization suites (from supplementary, 50 rollouts):
- Spatial: 97.3% (vs. OpenVLA 98.2%, $\pi$0.5 96.5)
- Object: 95.7% (vs. OpenVLA 98.6%, $\pi$0.5 96.8)
- Goal: 94.3% (vs. OpenVLA 97.6%, $\pi$0.5 95.6)
- CLaD is optimized for long-horizon planning and underperforms large VLAs on short-horizon generalization tasks where massive pre-training knowledge dominates.

Computational Efficiency Comparison

Method	Params (B)	Memory (GB)	Inference (Hz)	Planning Time (s)
OpenVLA	7.0	15	6	—
$\pi$0.5	3.3	19	10	—
UVA	0.5	—	—	0.195
LBP	0.19	—	—	0.008
CLaD	0.66	4	25	0.012

Ablation Studies

Modality contribution (LIBERO-LONG avg SR):

Variant	Avg SR (%)
Policy only (no foresight)	84.8
CLaD_p (proprioceptive only)	50.4
CLaD_s (semantic only)	91.5
CLaD (full cross-modal)	94.7

Proprioceptive foresight alone harms performance (50.4%), confirming kinematic predictions without semantic context introduce misleading conditioning signals.

Reconstruction loss ablation:

Objective	Avg SR (%)
$\mathcal{L}_{\text{latent}}$ only	86.1 (–8.6)
$\mathcal{L}_{\text{latent}} + \mathcal{L}_{\text{recon}}$	94.7

Cross-attention configuration:

Configuration	Avg SR (%)
Symmetric self-attention	86.7
Semantics queries proprioceptive	93.8
Proprioceptive queries semantic (ours)	94.7

Action masking ablation:

Variant	Avg SR (%)
CLaD (baseline, $r = 0.3$)	94.7
Heavy mask ($r = 0.9$)	88.2
Action-free	90.8
Curriculum (action-free → $r = 0.3$)	85.1

Reproduction Guide

Prerequisites

# Hardware: single NVIDIA RTX 4090 (or equivalent with ~24 GB VRAM)
# Software: Python 3.10+, PyTorch 2.x, CUDA 12.x
conda create -n clad python=3.10 -y
conda activate clad
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install transformers diffusers einops wandb

Step 1: Install CLaD

# As of paper publication, code is at the project page
git clone https://github.com/andrewwwj/clad.git
cd clad
pip install -e .

Step 2: Prepare LIBERO Dataset

# Install LIBERO benchmark
pip install libero

# Download LIBERO-LONG demonstrations
# Follow LIBERO repo instructions for dataset download
python -c "import libero; print(libero.get_libero_path('libero_long'))"

# Train dynamics model for 25K steps (~2 hours on RTX 4090)
# Key args: --hidden_dim 1024 --num_tokens 4 --action_horizon 6 \
#   --batch_size 128 --ema_momentum 0.995 --lambda_recon 0.1 \
#   --action_mask_ratio 0.3 --epochs 25000
python train_stage1.py \
  --dataset libero_long \
  --hidden_dim 1024 \
  --num_tokens 4 \
  --action_horizon 6 \
  --batch_size 128 \
  --ema_momentum 0.995 \
  --lambda_recon 0.1 \
  --action_mask_ratio 0.3 \
  --num_steps 25000

Verify Stage 1: Check that $\mathcal{L}_{\text{latent}}$ converges and $\mathcal{L}_{\text{recon}}$ remains stable. UMAP of $z_{\text{dyn}}$ should show task-specific clusters.

Step 4: Train Stage 2 (Diffusion Policy)

# Freeze CLaD dynamics, train diffusion policy for 200K steps (~20 hours)
python train_stage2.py \
  --dataset libero_long \
  --dynamics_ckpt checkpoints/stage1/best.pt \
  --batch_size 128 \
  --num_steps 200000

Verify Stage 2: Standard DDPM loss should decrease. Validate by running a single rollout.

Step 5: Evaluate

# Evaluate on LIBERO-LONG with top-3 checkpoint averaging, 20 rollouts
python evaluate.py \
  --benchmark libero_long \
  --ckpt checkpoints/stage2/best.pt \
  --num_rollouts 20 \
  --top_k_checkpoints 3

# Expected: ~94.5-94.7% average success rate

Verification Checklist

Stage 1 latent loss converges within ~10K steps
Stage 1 reconstruction loss stabilizes (not diverging)
UMAP visualization of $z_{\text{dyn}}$ shows distinct task clusters
Stage 2 DDPM loss decreases monotonically
Single-task rollout executes full multi-step sequence
Average success rate across 10 tasks is ≥ 90%

Notes

Key Takeaways

Transitions over states: CLaD’s central insight is that cross-modal consistency should be enforced over transitions, not static observations. This is a fundamental shift from prior cross-modal robotics methods (R3M, DecisionNCE, RPT) that align at individual timesteps.
Asymmetry matters: Proprioceptive transitions querying semantic transitions (94.7%) outperforms the reverse direction (93.8%) and symmetric attention (86.7%). This suggests robot kinematics provide a more grounded basis for interpreting scene changes — a useful inductive bias for manipulation.
Grounding is essential, not optional: The reconstruction loss $\mathcal{L}_{\text{recon}}$ accounts for an 8.6 pp performance gap (86.1% → 94.7%). Without it, the latent space drifts toward excessive abstraction and task clusters collapse. This validates the “grounded foresight” design philosophy.
Semantic foresight dominates, proprioceptive foresight can harm: Semantic-only foresight achieves 91.5% (good), proprioceptive-only drops to 50.4% (catastrophic). The joint model (94.7%) confirms that proprioception adds value only when grounded in semantic context.
Efficiency through latent planning: CLaD matches or exceeds 7B-parameter VLAs on long-horizon tasks at 0.66B with 25 Hz inference and 4 GB memory, demonstrating that compact latent foresight can substitute for expensive semantic generation.

Connections to Other Work

System 1 / System 2 framing [Kahneman, 2011; Stanovich & West, 2000]: CLaD explicitly casts its two stages as System 2 (deliberative cross-modal dynamics reasoning) and System 1 (reactive diffusion policy), similar to hierarchical approaches in TDMPC2, DreamerV3, and ACT.
BYOL/BYOL-SGD family [Grill et al., 2020]: The EMA target encoder + MSE on normalized embeddings follows the BYOL self-supervised learning paradigm, adapted for transition prediction.
I-JEPA / V-JEPA [Assran et al., 2023; Bardes et al., 2024]: The joint-embedding predictive architecture where features predict future features parallels I-JEPA’s approach to visual prediction without generative decoding.
FiLM conditioning [Perez et al., 2018]: Used both for VLM feature fusion (Stage 1) and for observation-modulated foresight conditioning (Stage 2), providing a lightweight mechanism for cross-modal grounding.
Latent planning [TD-MPC2, LBP]: CLaD operates in the latent planning paradigm but uniquely introduces explicit cross-modal transition modeling, distinguishing it from methods that implicitly conflate modalities.
Diffusion Policy [Chi et al., 2023]: The Stage 2 action generation uses standard DDPM conditioning extended with foresight-modulated observations.
Action masking [He et al., 2022]: The stochastic action token replacement follows MAE-style masking to encourage state-difference-based transition inference.

Limitations (from authors)

Compact latent representations may not capture fine visual details for tasks with perceptual ambiguity (Task 9: 82.0%).
Two-stage training requires ~22 hours; authors suggest amortizing Stage 1 pre-training on large-scale heterogeneous robot datasets.
Focused on manipulation; cross-modal dynamics principle could generalize to mobile manipulation with force/tactile feedback.