2026-03-29
Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning
Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik
problem
robot world models are trained as single-step predictors: given frame $x_t$ and action $a_t$, predict $x_{t+1}$. this works well for one-step accuracy, but breaks down catastrophically when deployed autoregressively for multi-step rollout. at each step, prediction error $\epsilon_t$ is fed back as input for the next step, and these errors compound. after $T$ steps the visual quality of generated frames degrades to the point of being unusable for planning.
this is the core bottleneck for using video world models in robotics. long-horizon tasks (manipulation sequences, navigation) require stable rollouts over $50$–$100+$ steps, but even state-of-the-art diffusion world models diverge after a handful of steps.
the fundamental issue is a train-test mismatch: the model is trained on ground-truth history ${x_0, \ldots, x_t}$ but at inference it receives its own predictions ${\hat{x}_0, \ldots, \hat{x}_t}$. small discrepancies accumulate, and the model has never learned to operate on its own noisy outputs.
architecture
flowchart TD
gt[ground truth x_t a_t x_t+1] --> Base[base diffusion world model]
Base --> Roll[autoregressive rollout]
Roll --> state[state s_t history]
state --> Gen[generate K candidate frames]
Gen --> c1[candidate 1]
Gen --> c2[candidate 2]
Gen --> ck[candidate K]
c1 --> RL[contrastive RL scoring]
c2 --> RL
ck --> RL
RL --> best[highest fidelity winner]
best --> Roll
style Base fill:#c4b8a6,color:#fff
style RL fill:#b09a84,color:#fff
the base world model is a diffusion-based video predictor. given observation $x_t$ and action $a_t$, it generates the next frame $\hat{x}_{t+1}$ via an iterative denoising process. the architecture itself is standard (UNet backbone, temporal conditioning on actions).
the key contribution is not architectural but training-based: a reinforcement learning post-training scheme that closes the train-test gap by training the model on its own autoregressive rollouts rather than ground-truth histories.
contrastive rl objective for diffusion
the RL objective is adapted from contrastive methods originally designed for discrete and continuous RL. for diffusion models, this is non-trivial because the output space is high-dimensional (full images) and the action space is the denoising trajectory.
given a state $s_t$ (the current rollout history), the model generates $K$ candidate next frames ${\hat{x}_{t+1}^{(1)}, \ldots, \hat{x}_{t+1}^{(K)}}$ by sampling different noise realizations. each candidate is scored by a reward function $\mathcal{R}$, and the RL objective reinforces candidates with higher rewards relative to the batch:
\[\mathcal{L}_{\text{RL}} = -\mathbb{E}_{s_t} \left[ \log \frac{\exp(\beta \, \mathcal{R}(\hat{x}_{t+1}^{(i)}, s_t))}{\sum_{j=1}^{K} \exp(\beta \, \mathcal{R}(\hat{x}_{t+1}^{(j)}, s_t))} \right]\]where $\beta$ is a temperature parameter controlling the sharpness of the preference distribution. this is a softmax contrastive loss: it pushes probability mass toward the best candidate within the batch.
reward design
the reward $\mathcal{R}$ combines multi-view perceptual fidelity metrics. for a multi-camera robot setup with external and wrist cameras:
\[\mathcal{R}(\hat{x}_{t+1}, x_{t+1}) = -\lambda_{\text{lpips}} \cdot \text{LPIPS}(\hat{x}_{t+1}, x_{t+1}) + \lambda_{\text{ssim}} \cdot \text{SSIM}(\hat{x}_{t+1}, x_{t+1})\]where $\text{LPIPS}$ measures perceptual similarity (lower is better) and $\text{SSIM}$ measures structural similarity (higher is better), computed across camera views. this provides dense, low-variance supervision compared to sparse task rewards.
variable-length rollout training
the training protocol samples variable-length rollout horizons $T \sim \mathcal{U}(1, T_{\max})$. for each rollout, the model generates frames autoregressively and receives RL signal at each step. this curriculum means the model sees and learns to correct for errors at every timescale simultaneously.
training
base model training: standard diffusion denoising objective on single-step ground-truth pairs $(x_t, a_t, x_{t+1})$.
RL post-training phase:
- sample a rollout length $T$ and an initial state from the dataset
- generate $K$ candidate futures from each autoregressive step
- compute rewards against the ground-truth continuation
- apply the contrastive RL update
- the model’s own predictions serve as conditioning for subsequent steps
this is applied as post-training on an already-converged base diffusion model. the RL phase does not require the base model to change architecture – it only adjusts the parameters via the contrastive gradient signal.
evaluation
evaluated on the DROID dataset (diverse real-robot manipulation trajectories across multiple scenes and objects).
quantitative metrics
| metric | camera view | improvement |
|---|---|---|
| LPIPS | external | -14% |
| SSIM | wrist | +9.1% |
these numbers represent the improvement of the RL post-trained model over the base diffusion world model baseline on autoregressive rollouts.
paired comparison
- 98% win rate in automated paired comparisons against the baseline across rollout steps
- 80% human preference in blind human evaluation study where annotators compared rollout videos from the base model vs RL post-trained model
qualitative behavior
the RL post-trained model maintains visual coherence significantly longer during autoregressive rollouts. the base diffusion model shows characteristic failure modes: color drift, object blur, and eventual structural collapse. the RL post-trained model suppresses these failure modes by learning to generate predictions that remain stable under its own feedback loop.
reproduction guide
- no public code repo as of publication. this is a significant barrier to reproduction.
- dataset: DROID is publicly available (droid-dataset.github.io) – download and preprocess multi-view manipulation trajectories.
- base model: train or obtain a standard diffusion world model on single-step prediction. any reasonable UNet-based video diffusion architecture should work as the base.
- RL post-training: implement the contrastive RL objective:
- sample $K$ candidates per step ($K=4$–$8$ is typical)
- compute LPIPS + SSIM rewards against ground-truth frames
- apply the softmax contrastive loss with temperature $\beta$
- variable-length rollouts: start with short horizons ($T=4$) and progressively increase to $T=16$–$32$ during RL post-training.
- key ablation: compare fixed ground-truth conditioning (standard training) vs autoregressive conditioning (RL post-training) to isolate the compounding error effect.
notes
the central insight is reframing world model stability as an RL problem. rather than designing complex architectures to prevent error accumulation (recurrent state, latent feedback, etc.), they simply train the model on the distribution it will actually encounter at deployment time: its own generated outputs.
the multi-candidate rollout comparison is clever and practical. by generating $K$ futures from the same state, you get a ranking within each batch without needing a separate learned reward model. the ground-truth frame provides the “answer key” for ranking. this is much cheaper than training a reward model and provides direct gradient signal.
this approach is complementary to architectural solutions. LeWorldModel (2603.19312) attacks stability from the architecture side (Gaussian prior, latent space design), while this paper attacks it from the training side. the open question is whether combining both yields multiplicative gains.
relevance to BOPI: this directly addresses the rollout stability bottleneck that makes video world models impractical for long-horizon planning. if RL post-training can stabilize arbitrary base diffusion world models, it provides a general-purpose tool rather than requiring bespoke architectures. the contrastive RL objective is also architecturally agnostic – it could potentially be applied to non-diffusion world models as well.
open questions:
- how much compute does the RL post-training phase require relative to the base model training?
- does this transfer across robot embodiments and tasks, or does the RL phase need to be rerun per domain?
- what is the maximum stable rollout horizon achievable, and does it scale with more RL training?
- can the reward be extended beyond perceptual metrics to include task-relevant signals (e.g., object position accuracy)?