Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

problem

robot world models are trained as single-step predictors: given frame $x_t$ and action $a_t$, predict $x_{t+1}$. this works well for one-step accuracy, but breaks down catastrophically when deployed autoregressively for multi-step rollout. at each step, prediction error $\epsilon_t$ is fed back as input for the next step, and these errors compound. after $T$ steps the visual quality of generated frames degrades to the point of being unusable for planning.

this is the core bottleneck for using video world models in robotics. long-horizon tasks (manipulation sequences, navigation) require stable rollouts over $50$–$100+$ steps, but even state-of-the-art diffusion world models diverge after a handful of steps.

the fundamental issue is a train-test mismatch: the model is trained on ground-truth history ${x_0, \ldots, x_t}$ but at inference it receives its own predictions ${\hat{x}_0, \ldots, \hat{x}_t}$. small discrepancies accumulate, and the model has never learned to operate on its own noisy outputs.

architecture

flowchart TD
    gt[ground truth x_t a_t x_t+1] --> Base[base diffusion world model]
    Base --> Roll[autoregressive rollout]
    Roll --> state[state s_t history]
    state --> Gen[generate K candidate frames]
    Gen --> c1[candidate 1]
    Gen --> c2[candidate 2]
    Gen --> ck[candidate K]
    c1 --> RL[contrastive RL scoring]
    c2 --> RL
    ck --> RL
    RL --> best[highest fidelity winner]
    best --> Roll
    
    style Base fill:#c4b8a6,color:#fff
    style RL fill:#b09a84,color:#fff

the base world model is a diffusion-based video predictor. given observation $x_t$ and action $a_t$, it generates the next frame $\hat{x}_{t+1}$ via an iterative denoising process. the architecture itself is standard (UNet backbone, temporal conditioning on actions).

the key contribution is not architectural but training-based: a reinforcement learning post-training scheme that closes the train-test gap by training the model on its own autoregressive rollouts rather than ground-truth histories.

contrastive rl objective for diffusion

the RL objective is adapted from contrastive methods originally designed for discrete and continuous RL. for diffusion models, this is non-trivial because the output space is high-dimensional (full images) and the action space is the denoising trajectory.

given a state $s_t$ (the current rollout history), the model generates $K$ candidate next frames ${\hat{x}_{t+1}^{(1)}, \ldots, \hat{x}_{t+1}^{(K)}}$ by sampling different noise realizations. each candidate is scored by a reward function $\mathcal{R}$, and the RL objective reinforces candidates with higher rewards relative to the batch:

\[\mathcal{L}_{\text{RL}} = -\mathbb{E}_{s_t} \left[ \log \frac{\exp(\beta \, \mathcal{R}(\hat{x}_{t+1}^{(i)}, s_t))}{\sum_{j=1}^{K} \exp(\beta \, \mathcal{R}(\hat{x}_{t+1}^{(j)}, s_t))} \right]\]

where $\beta$ is a temperature parameter controlling the sharpness of the preference distribution. this is a softmax contrastive loss: it pushes probability mass toward the best candidate within the batch.

reward design

the reward $\mathcal{R}$ combines multi-view perceptual fidelity metrics. for a multi-camera robot setup with external and wrist cameras:

\[\mathcal{R}(\hat{x}_{t+1}, x_{t+1}) = -\lambda_{\text{lpips}} \cdot \text{LPIPS}(\hat{x}_{t+1}, x_{t+1}) + \lambda_{\text{ssim}} \cdot \text{SSIM}(\hat{x}_{t+1}, x_{t+1})\]

where $\text{LPIPS}$ measures perceptual similarity (lower is better) and $\text{SSIM}$ measures structural similarity (higher is better), computed across camera views. this provides dense, low-variance supervision compared to sparse task rewards.

variable-length rollout training

the training protocol samples variable-length rollout horizons $T \sim \mathcal{U}(1, T_{\max})$. for each rollout, the model generates frames autoregressively and receives RL signal at each step. this curriculum means the model sees and learns to correct for errors at every timescale simultaneously.

training

base model training: standard diffusion denoising objective on single-step ground-truth pairs $(x_t, a_t, x_{t+1})$.

RL post-training phase:

sample a rollout length $T$ and an initial state from the dataset
generate $K$ candidate futures from each autoregressive step
compute rewards against the ground-truth continuation
apply the contrastive RL update
the model’s own predictions serve as conditioning for subsequent steps

this is applied as post-training on an already-converged base diffusion model. the RL phase does not require the base model to change architecture – it only adjusts the parameters via the contrastive gradient signal.

evaluation

evaluated on the DROID dataset (diverse real-robot manipulation trajectories across multiple scenes and objects).

quantitative metrics

metric	camera view	improvement
LPIPS	external	-14%
SSIM	wrist	+9.1%

these numbers represent the improvement of the RL post-trained model over the base diffusion world model baseline on autoregressive rollouts.

paired comparison

98% win rate in automated paired comparisons against the baseline across rollout steps
80% human preference in blind human evaluation study where annotators compared rollout videos from the base model vs RL post-trained model

qualitative behavior

the RL post-trained model maintains visual coherence significantly longer during autoregressive rollouts. the base diffusion model shows characteristic failure modes: color drift, object blur, and eventual structural collapse. the RL post-trained model suppresses these failure modes by learning to generate predictions that remain stable under its own feedback loop.

reproduction guide

no public code repo as of publication. this is a significant barrier to reproduction.
dataset: DROID is publicly available (droid-dataset.github.io) – download and preprocess multi-view manipulation trajectories.
base model: train or obtain a standard diffusion world model on single-step prediction. any reasonable UNet-based video diffusion architecture should work as the base.
RL post-training: implement the contrastive RL objective:
- sample $K$ candidates per step ($K=4$–$8$ is typical)
- compute LPIPS + SSIM rewards against ground-truth frames
- apply the softmax contrastive loss with temperature $\beta$
variable-length rollouts: start with short horizons ($T=4$) and progressively increase to $T=16$–$32$ during RL post-training.
key ablation: compare fixed ground-truth conditioning (standard training) vs autoregressive conditioning (RL post-training) to isolate the compounding error effect.

notes

the central insight is reframing world model stability as an RL problem. rather than designing complex architectures to prevent error accumulation (recurrent state, latent feedback, etc.), they simply train the model on the distribution it will actually encounter at deployment time: its own generated outputs.

the multi-candidate rollout comparison is clever and practical. by generating $K$ futures from the same state, you get a ranking within each batch without needing a separate learned reward model. the ground-truth frame provides the “answer key” for ranking. this is much cheaper than training a reward model and provides direct gradient signal.

this approach is complementary to architectural solutions. LeWorldModel (2603.19312) attacks stability from the architecture side (Gaussian prior, latent space design), while this paper attacks it from the training side. the open question is whether combining both yields multiplicative gains.

relevance to BOPI: this directly addresses the rollout stability bottleneck that makes video world models impractical for long-horizon planning. if RL post-training can stabilize arbitrary base diffusion world models, it provides a general-purpose tool rather than requiring bespoke architectures. the contrastive RL objective is also architecturally agnostic – it could potentially be applied to non-diffusion world models as well.

open questions:

how much compute does the RL post-training phase require relative to the base model training?
does this transfer across robot embodiments and tasks, or does the RL phase need to be rerun per domain?
what is the maximum stable rollout horizon achievable, and does it scale with more RL training?
can the reward be extended beyond perceptual metrics to include task-relevant signals (e.g., object position accuracy)?