2026-04-01

Do World Action Models Generalize Better than VLAs? A Robustness Study

Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pakdamansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, Feng Wen, Xinyu Wang, Xingyue Quan, Yingxue Zhang et al.

world-models VLA robotics robustness

problem

robot policies must handle real-world diversity and uncertainty. Vision-Language-Action (VLA) models map current state directly to action: $p_\theta(a_t \mid h_t)$. World Action Models (WAMs) predict both future state and action: either jointly $p_\phi(h_{t+1}, a_t \mid h_t)$ or in an inverse dynamics style $p_\phi(h_{t+1} \mid h_t) \cdot g_\psi(a_t \mid h_t, h_{t+1})$. WAMs claim better generalization via spatiotemporal priors from web-scale video pre-training, but this had never been systematically tested under controlled perturbations.

VLAs like $\pi_0.5$ and OpenVLA are fine-tuned from VLM backbones pre-trained on static image-text data (next-token prediction), which gives them strong language understanding but limited fine-grained dynamic prediction capability. WAMs like Cosmos-Policy and GE-Act use video diffusion backbones pre-trained on web-scale video for future state prediction $p_\phi(h_{t+1} \mid h_t)$, giving them spatiotemporal priors that might help with visual perturbations. but WAMs are slower (need to denoise visual state) and haven’t been compared apples-to-apples.

hybrid models (VLA-JEPA, MOTUS) try to combine both: VLA-JEPA uses a Qwen3-VL-2B backbone with JEPA-style future state prediction, and MOTUS uses a Mixture-of-Transformers with Wan2.2-5B for video and a separate VLM for action.

this paper introduces two robustness benchmarks (LIBERO-Plus and RoboTwin 2.0-Plus) with 7 perturbation dimensions each, and systematically compares VLAs, WAMs, and hybrid models.

architecture

this is a comparative study, not a new method. the paper evaluates existing models:

WAMs evaluated:

  • Cosmos-Policy (2B params): unified transformer for video and action from Cosmos-Predict2-2B. joint denoising of state (5 steps) and action (5 steps). pretrain-free. action as latent frame + native 7-dim, chunk size 16.
  • GE-Act (2.2B params): Mixture-of-Transformers (MOT) with LTX-Video-2B for video and separate action stream with cross-modal attention. 1 state denoising step, 10 action steps. auto-regressive generation.
  • LingBot-VA (5.3B params): Wan2.2-5B backbone, interleaved future visual state + action prediction. causal IDM-style: action conditioned on predicted visual state. real-world: 3 state + 5 action denoising steps. 30-dim action (absolute EEF quaternion + joints), chunk size 4.
  • DreamZero (14B params): Wan2.1-14B backbone. excluded from benchmarks due to proprietary data and 15+ minute warm-up time.
  • GigaWorld-Policy (>5B): Wan2.2-5B backbone. predicts action first, then future visual state. omits state generation at test time. latency 360ms. no public checkpoints.
  • Fast-WAM (6B): Wan2.2-5B backbone. MOT architecture. pretrain-free. latency 190ms. no public checkpoints.

VLAs evaluated:

  • $\pi_0.5$: VLM backbone, absolute EEF/joint control, 50-step chunks, 50Hz, flow matching
  • $\pi_0$: VLM backbone, delta EEF, 50-step chunks, 50Hz
  • $\pi_0$-FAST: FAST tokenized delta EEF, 50-step chunks, 15-50Hz
  • OpenVLA-OFT: VLM backbone, delta EEF, 8-step chunks, 3-10Hz
  • X-VLA: absolute EEF with 6D rotation, 10/20-dim, 32-step chunks
  • VLA-JEPA: Qwen3-VL-2B + JEPA future-state prediction, delta EEF, 7-step horizon, conditional flow-matching
  • MOTUS: MOT with Wan2.2-5B for video + separate VLM for action, 14-dim latent action (optical flow), 48-step chunks, 30Hz

training

this is a comparative study. training details are from the evaluated models:

$\pi_0.5$ finetuning on RoboTwin: pretrained $\pi_0.5$ checkpoint, full 27.5K RoboTwin 2.0 training trajectories, 60K gradient steps, AdamW ($\beta_1=0.9$, $\beta_2=0.95$), gradient clipping 1.0, cosine LR decay from $2.5 \times 10^{-5}$ to $2.5 \times 10^{-6}$, batch size 64, delta joint actions, openpi (JAX).

training data summary by model:

model web video robot data task-specific pretrain-free
$\pi_0$ >10k h cross-embodiment 5-100 h no
$\pi_0.5$ web VQA/captioning 400h mobile manip 1-20 h no
OpenVLA-OFT 970k cross-embodiment 20-300 demos no
X-VLA 288k cross-embodiment 50 demos no
Cosmos-Policy 185 demos yes
GE-Act 3k h single-embodiment 1 h no
LingBot-VA 16k h cross-embodiment 50 demos no
VLA-JEPA 220k human ego 76k single-embodiment 100 demos no
MOTUS 231k human ego 781k cross-embodiment 100 demos + 1k task-agnostic no

evaluation

LIBERO-Plus (MuJoCo/robosuite, Franka Panda 7-DoF, 256 $\times$ 256, 7-dim delta EEF at 10Hz): 40 base tasks, 416 distractor objects, 7 perturbation dimensions. $\pi_0.5$ finetuned with 60K gradient steps.

LIBERO-Plus results (success rate %):

model Original Camera Robot Lang Light BG Noise Layout Total
$\pi_0$ 94.2 13.8 6.0 58.8 85.0 81.4 79.0 68.9 53.6
$\pi_0$ (rerun) 91.3 61.0 40.8 63.5 89.3 84.1 80.1 76.4 69.4
$\pi_0.5$ 96.9 75.4 77.5 85.6 96.9 94.6 89.7 85.7 85.7
OpenVLA-OFT 97.6 55.6 21.7 81.0 92.7 91.0 78.6 68.7 67.9
X-VLA 98.1 23.4 89.7 75.7 88.2 96.0 62.7 71.8 71.4
RIPT-VLA 97.5 55.2 31.2 77.6 88.4 91.6 73.5 74.2 68.4
ABot-M0 98.6 60.4 67.9 86.4 96.2 91.6 86.4 82.6 80.5
VLA-JEPA 97.2 64.2 67.7 88.1 91.8 93.4 65.8 83.9 77.9
GE-Act 94.4 60.7 77.0 77.4 95.8 86.0 90.9 80.2 80.3
Cosmos-Policy 98.5 75.8 63.3 81.7 96.5 88.9 92.7 82.2 82.2

overall ranking: $\pi_0.5$ (85.7) > Cosmos-Policy (82.2) > ABot-M0 (80.5) > GE-Act (80.3) > VLA-JEPA (77.9) > X-VLA (71.4).

RoboTwin 2.0-Plus (SAPIEN/ManiSkill3, Aloha-AgileX 14-DoF bimanual, 320 $\times$ 240, 14-dim joint positions at 25-30Hz): 50 collaborative bimanual tasks, 731 distractor objects, 21 sub-dimensions across 7 perturbation types.

RoboTwin 2.0-Plus results (success rate %):

model Original Camera Robot Lang Light BG Noise Layout Total
$\pi_0.5$ 78.4 45.6 27.6 74.4 49.6 71.7 64.9 56.8 58.6
X-VLA 65.6 23.2 65.2 64.4 63.1 58.6 49.7 34.8 53.1
MOTUS 87.0 21.6 85.0 83.2 84.6 84.4 43.1 82.8 71.5
LingBot-VA 92.1 28.9 36.2 87.3 89.0 91.3 80.9 87.9 74.2

LingBot-VA dominates on 6/8 dimensions but collapses on camera perturbation (28.9) and robot initial state (36.2). $\pi_0.5$ best on camera perturbation but worst on robot initial state (27.6).

inference speed (same device):

model chunk time (ms) relative
$\pi_0.5$ 50 63 1.0$\times$
X-VLA 30 195 3.1$\times$
GE-Act 36 300 4.8$\times$
Cosmos-Policy 16 390 6.2$\times$
LingBot-VA (RW) 32 480 7.6$\times$
MOTUS 16 1175 18.6$\times$
LingBot-VA (RT) 32 5230 83.0$\times$

WAMs are 4.8-83$\times$ slower than $\pi_0.5$. the bottleneck is visual state denoising steps.

reproduction guide

no new model to reproduce. the paper introduces RoboTwin 2.0-Plus benchmark.

to reproduce evaluations: install SAPIEN (ManiSkill3) for RoboTwin 2.0-Plus and MuJoCo (robosuite) for LIBERO-Plus. use released checkpoints: X-VLA, MOTUS, LingBot-VA for RoboTwin; $\pi_0.5$ via openpi (JAX), $\pi_0$, OpenVLA-OFT, VLA-JEPA, GE-Act, Cosmos-Policy for LIBERO. run 8 configs per task (1 clean + 7 perturbation), 50 episodes each.

known gotchas: performance gap between JAX and PyTorch implementations of $\pi$-series models. DreamZero, GigaWorld-Policy, and Fast-WAM checkpoints not publicly available. C2 (spherical camera position) disabled by default to avoid instability.

notes

the most striking finding is that VLAs and WAMs have complementary robustness profiles. WAMs dominate on visual perturbations (noise, lighting, layout, background) – this is where spatiotemporal video priors help. but WAMs are surprisingly weak on camera viewpoint changes and robot initial state perturbations – this is where the video prior doesn’t help and may even hurt (the model expects a specific camera perspective).

$\pi_0.5$ achieves 85.7% total on LIBERO-Plus by being strong across the board rather than excelling in one dimension. this is attributed to its diverse training data (web VQA/captioning + 400h mobile manipulation + cross-embodiment data). the implication: training data diversity matters as much as architecture for robustness.

the speed gap is brutal. LingBot-VA on RoboTwin is 83$\times$ slower than $\pi_0.5$ (5230ms vs 63ms). even the fastest WAM (GE-Act at 300ms) is 4.8$\times$ slower. for real-time robot control at 10-50Hz, this latency is a deployment blocker. Cosmos-Policy reduces chunk size to 16 (vs 50 for $\pi_0.5$) to partially compensate but is still 6.2$\times$ slower.

the hybrid models (VLA-JEPA, MOTUS) sit between pure VLAs and WAMs – they get some robustness benefit from future state prediction without the full latency cost. VLA-JEPA at 77.9% on LIBERO-Plus outperforms all pure VLAs except $\pi_0.5$ and ABot-M0, suggesting that adding JEPA-style prediction heads to VLA backbones is a practical middle ground.

camera perturbation is the hardest dimension for every model. even $\pi_0.5$ drops from 96.9% to 75.4%. this is an open problem – neither language understanding (VLA) nor video priors (WAM) helps when the camera viewpoint shifts.