2026-04-01
Do World Action Models Generalize Better than VLAs? A Robustness Study
Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pakdamansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, Feng Wen, Xinyu Wang, Xingyue Quan, Yingxue Zhang et al.
problem
robot policies must handle real-world diversity and uncertainty. Vision-Language-Action (VLA) models map current state directly to action: $p_\theta(a_t \mid h_t)$. World Action Models (WAMs) predict both future state and action: either jointly $p_\phi(h_{t+1}, a_t \mid h_t)$ or in an inverse dynamics style $p_\phi(h_{t+1} \mid h_t) \cdot g_\psi(a_t \mid h_t, h_{t+1})$. WAMs claim better generalization via spatiotemporal priors from web-scale video pre-training, but this had never been systematically tested under controlled perturbations.
VLAs like $\pi_0.5$ and OpenVLA are fine-tuned from VLM backbones pre-trained on static image-text data (next-token prediction), which gives them strong language understanding but limited fine-grained dynamic prediction capability. WAMs like Cosmos-Policy and GE-Act use video diffusion backbones pre-trained on web-scale video for future state prediction $p_\phi(h_{t+1} \mid h_t)$, giving them spatiotemporal priors that might help with visual perturbations. but WAMs are slower (need to denoise visual state) and haven’t been compared apples-to-apples.
hybrid models (VLA-JEPA, MOTUS) try to combine both: VLA-JEPA uses a Qwen3-VL-2B backbone with JEPA-style future state prediction, and MOTUS uses a Mixture-of-Transformers with Wan2.2-5B for video and a separate VLM for action.
this paper introduces two robustness benchmarks (LIBERO-Plus and RoboTwin 2.0-Plus) with 7 perturbation dimensions each, and systematically compares VLAs, WAMs, and hybrid models.
architecture
this is a comparative study, not a new method. the paper evaluates existing models:
WAMs evaluated:
- Cosmos-Policy (2B params): unified transformer for video and action from Cosmos-Predict2-2B. joint denoising of state (5 steps) and action (5 steps). pretrain-free. action as latent frame + native 7-dim, chunk size 16.
- GE-Act (2.2B params): Mixture-of-Transformers (MOT) with LTX-Video-2B for video and separate action stream with cross-modal attention. 1 state denoising step, 10 action steps. auto-regressive generation.
- LingBot-VA (5.3B params): Wan2.2-5B backbone, interleaved future visual state + action prediction. causal IDM-style: action conditioned on predicted visual state. real-world: 3 state + 5 action denoising steps. 30-dim action (absolute EEF quaternion + joints), chunk size 4.
- DreamZero (14B params): Wan2.1-14B backbone. excluded from benchmarks due to proprietary data and 15+ minute warm-up time.
- GigaWorld-Policy (>5B): Wan2.2-5B backbone. predicts action first, then future visual state. omits state generation at test time. latency 360ms. no public checkpoints.
- Fast-WAM (6B): Wan2.2-5B backbone. MOT architecture. pretrain-free. latency 190ms. no public checkpoints.
VLAs evaluated:
- $\pi_0.5$: VLM backbone, absolute EEF/joint control, 50-step chunks, 50Hz, flow matching
- $\pi_0$: VLM backbone, delta EEF, 50-step chunks, 50Hz
- $\pi_0$-FAST: FAST tokenized delta EEF, 50-step chunks, 15-50Hz
- OpenVLA-OFT: VLM backbone, delta EEF, 8-step chunks, 3-10Hz
- X-VLA: absolute EEF with 6D rotation, 10/20-dim, 32-step chunks
- VLA-JEPA: Qwen3-VL-2B + JEPA future-state prediction, delta EEF, 7-step horizon, conditional flow-matching
- MOTUS: MOT with Wan2.2-5B for video + separate VLM for action, 14-dim latent action (optical flow), 48-step chunks, 30Hz
training
this is a comparative study. training details are from the evaluated models:
$\pi_0.5$ finetuning on RoboTwin: pretrained $\pi_0.5$ checkpoint, full 27.5K RoboTwin 2.0 training trajectories, 60K gradient steps, AdamW ($\beta_1=0.9$, $\beta_2=0.95$), gradient clipping 1.0, cosine LR decay from $2.5 \times 10^{-5}$ to $2.5 \times 10^{-6}$, batch size 64, delta joint actions, openpi (JAX).
training data summary by model:
| model | web video | robot data | task-specific | pretrain-free |
|---|---|---|---|---|
| $\pi_0$ | – | >10k h cross-embodiment | 5-100 h | no |
| $\pi_0.5$ | web VQA/captioning | 400h mobile manip | 1-20 h | no |
| OpenVLA-OFT | – | 970k cross-embodiment | 20-300 demos | no |
| X-VLA | – | 288k cross-embodiment | 50 demos | no |
| Cosmos-Policy | – | – | 185 demos | yes |
| GE-Act | – | 3k h single-embodiment | 1 h | no |
| LingBot-VA | – | 16k h cross-embodiment | 50 demos | no |
| VLA-JEPA | 220k human ego | 76k single-embodiment | 100 demos | no |
| MOTUS | 231k human ego | 781k cross-embodiment | 100 demos + 1k task-agnostic | no |
evaluation
LIBERO-Plus (MuJoCo/robosuite, Franka Panda 7-DoF, 256 $\times$ 256, 7-dim delta EEF at 10Hz): 40 base tasks, 416 distractor objects, 7 perturbation dimensions. $\pi_0.5$ finetuned with 60K gradient steps.
LIBERO-Plus results (success rate %):
| model | Original | Camera | Robot | Lang | Light | BG | Noise | Layout | Total |
|---|---|---|---|---|---|---|---|---|---|
| $\pi_0$ | 94.2 | 13.8 | 6.0 | 58.8 | 85.0 | 81.4 | 79.0 | 68.9 | 53.6 |
| $\pi_0$ (rerun) | 91.3 | 61.0 | 40.8 | 63.5 | 89.3 | 84.1 | 80.1 | 76.4 | 69.4 |
| $\pi_0.5$ | 96.9 | 75.4 | 77.5 | 85.6 | 96.9 | 94.6 | 89.7 | 85.7 | 85.7 |
| OpenVLA-OFT | 97.6 | 55.6 | 21.7 | 81.0 | 92.7 | 91.0 | 78.6 | 68.7 | 67.9 |
| X-VLA | 98.1 | 23.4 | 89.7 | 75.7 | 88.2 | 96.0 | 62.7 | 71.8 | 71.4 |
| RIPT-VLA | 97.5 | 55.2 | 31.2 | 77.6 | 88.4 | 91.6 | 73.5 | 74.2 | 68.4 |
| ABot-M0 | 98.6 | 60.4 | 67.9 | 86.4 | 96.2 | 91.6 | 86.4 | 82.6 | 80.5 |
| VLA-JEPA | 97.2 | 64.2 | 67.7 | 88.1 | 91.8 | 93.4 | 65.8 | 83.9 | 77.9 |
| GE-Act | 94.4 | 60.7 | 77.0 | 77.4 | 95.8 | 86.0 | 90.9 | 80.2 | 80.3 |
| Cosmos-Policy | 98.5 | 75.8 | 63.3 | 81.7 | 96.5 | 88.9 | 92.7 | 82.2 | 82.2 |
overall ranking: $\pi_0.5$ (85.7) > Cosmos-Policy (82.2) > ABot-M0 (80.5) > GE-Act (80.3) > VLA-JEPA (77.9) > X-VLA (71.4).
RoboTwin 2.0-Plus (SAPIEN/ManiSkill3, Aloha-AgileX 14-DoF bimanual, 320 $\times$ 240, 14-dim joint positions at 25-30Hz): 50 collaborative bimanual tasks, 731 distractor objects, 21 sub-dimensions across 7 perturbation types.
RoboTwin 2.0-Plus results (success rate %):
| model | Original | Camera | Robot | Lang | Light | BG | Noise | Layout | Total |
|---|---|---|---|---|---|---|---|---|---|
| $\pi_0.5$ | 78.4 | 45.6 | 27.6 | 74.4 | 49.6 | 71.7 | 64.9 | 56.8 | 58.6 |
| X-VLA | 65.6 | 23.2 | 65.2 | 64.4 | 63.1 | 58.6 | 49.7 | 34.8 | 53.1 |
| MOTUS | 87.0 | 21.6 | 85.0 | 83.2 | 84.6 | 84.4 | 43.1 | 82.8 | 71.5 |
| LingBot-VA | 92.1 | 28.9 | 36.2 | 87.3 | 89.0 | 91.3 | 80.9 | 87.9 | 74.2 |
LingBot-VA dominates on 6/8 dimensions but collapses on camera perturbation (28.9) and robot initial state (36.2). $\pi_0.5$ best on camera perturbation but worst on robot initial state (27.6).
inference speed (same device):
| model | chunk | time (ms) | relative |
|---|---|---|---|
| $\pi_0.5$ | 50 | 63 | 1.0$\times$ |
| X-VLA | 30 | 195 | 3.1$\times$ |
| GE-Act | 36 | 300 | 4.8$\times$ |
| Cosmos-Policy | 16 | 390 | 6.2$\times$ |
| LingBot-VA (RW) | 32 | 480 | 7.6$\times$ |
| MOTUS | 16 | 1175 | 18.6$\times$ |
| LingBot-VA (RT) | 32 | 5230 | 83.0$\times$ |
WAMs are 4.8-83$\times$ slower than $\pi_0.5$. the bottleneck is visual state denoising steps.
reproduction guide
no new model to reproduce. the paper introduces RoboTwin 2.0-Plus benchmark.
to reproduce evaluations: install SAPIEN (ManiSkill3) for RoboTwin 2.0-Plus and MuJoCo (robosuite) for LIBERO-Plus. use released checkpoints: X-VLA, MOTUS, LingBot-VA for RoboTwin; $\pi_0.5$ via openpi (JAX), $\pi_0$, OpenVLA-OFT, VLA-JEPA, GE-Act, Cosmos-Policy for LIBERO. run 8 configs per task (1 clean + 7 perturbation), 50 episodes each.
known gotchas: performance gap between JAX and PyTorch implementations of $\pi$-series models. DreamZero, GigaWorld-Policy, and Fast-WAM checkpoints not publicly available. C2 (spherical camera position) disabled by default to avoid instability.
notes
the most striking finding is that VLAs and WAMs have complementary robustness profiles. WAMs dominate on visual perturbations (noise, lighting, layout, background) – this is where spatiotemporal video priors help. but WAMs are surprisingly weak on camera viewpoint changes and robot initial state perturbations – this is where the video prior doesn’t help and may even hurt (the model expects a specific camera perspective).
$\pi_0.5$ achieves 85.7% total on LIBERO-Plus by being strong across the board rather than excelling in one dimension. this is attributed to its diverse training data (web VQA/captioning + 400h mobile manipulation + cross-embodiment data). the implication: training data diversity matters as much as architecture for robustness.
the speed gap is brutal. LingBot-VA on RoboTwin is 83$\times$ slower than $\pi_0.5$ (5230ms vs 63ms). even the fastest WAM (GE-Act at 300ms) is 4.8$\times$ slower. for real-time robot control at 10-50Hz, this latency is a deployment blocker. Cosmos-Policy reduces chunk size to 16 (vs 50 for $\pi_0.5$) to partially compensate but is still 6.2$\times$ slower.
the hybrid models (VLA-JEPA, MOTUS) sit between pure VLAs and WAMs – they get some robustness benefit from future state prediction without the full latency cost. VLA-JEPA at 77.9% on LIBERO-Plus outperforms all pure VLAs except $\pi_0.5$ and ABot-M0, suggesting that adding JEPA-style prediction heads to VLA backbones is a practical middle ground.
camera perturbation is the hardest dimension for every model. even $\pi_0.5$ drops from 96.9% to 75.4%. this is an open problem – neither language understanding (VLA) nor video priors (WAM) helps when the camera viewpoint shifts.