Do World Action Models Generalize Better than VLAs? A Robustness Study

Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pakdamansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, Feng Wen, Xinyu Wang, Xingyue Quan, Yingxue Zhang et al.

problem

robot policies must handle real-world diversity and uncertainty. Vision-Language-Action (VLA) models map current state directly to action: $p_\theta(a_t \mid h_t)$. World Action Models (WAMs) predict both future state and action: either jointly $p_\phi(h_{t+1}, a_t \mid h_t)$ or in an inverse dynamics style $p_\phi(h_{t+1} \mid h_t) \cdot g_\psi(a_t \mid h_t, h_{t+1})$. WAMs claim better generalization via spatiotemporal priors from web-scale video pre-training, but this had never been systematically tested under controlled perturbations.

VLAs like $\pi_0.5$ and OpenVLA are fine-tuned from VLM backbones pre-trained on static image-text data (next-token prediction), which gives them strong language understanding but limited fine-grained dynamic prediction capability. WAMs like Cosmos-Policy and GE-Act use video diffusion backbones pre-trained on web-scale video for future state prediction $p_\phi(h_{t+1} \mid h_t)$, giving them spatiotemporal priors that might help with visual perturbations. but WAMs are slower (need to denoise visual state) and haven’t been compared apples-to-apples.

hybrid models (VLA-JEPA, MOTUS) try to combine both: VLA-JEPA uses a Qwen3-VL-2B backbone with JEPA-style future state prediction, and MOTUS uses a Mixture-of-Transformers with Wan2.2-5B for video and a separate VLM for action.

this paper introduces two robustness benchmarks (LIBERO-Plus and RoboTwin 2.0-Plus) with 7 perturbation dimensions each, and systematically compares VLAs, WAMs, and hybrid models.

architecture

this is a comparative study, not a new method. the paper evaluates existing models:

WAMs evaluated:

Cosmos-Policy (2B params): unified transformer for video and action from Cosmos-Predict2-2B. joint denoising of state (5 steps) and action (5 steps). pretrain-free. action as latent frame + native 7-dim, chunk size 16.
GE-Act (2.2B params): Mixture-of-Transformers (MOT) with LTX-Video-2B for video and separate action stream with cross-modal attention. 1 state denoising step, 10 action steps. auto-regressive generation.
LingBot-VA (5.3B params): Wan2.2-5B backbone, interleaved future visual state + action prediction. causal IDM-style: action conditioned on predicted visual state. real-world: 3 state + 5 action denoising steps. 30-dim action (absolute EEF quaternion + joints), chunk size 4.
DreamZero (14B params): Wan2.1-14B backbone. excluded from benchmarks due to proprietary data and 15+ minute warm-up time.
GigaWorld-Policy (>5B): Wan2.2-5B backbone. predicts action first, then future visual state. omits state generation at test time. latency 360ms. no public checkpoints.
Fast-WAM (6B): Wan2.2-5B backbone. MOT architecture. pretrain-free. latency 190ms. no public checkpoints.

VLAs evaluated:

$\pi_0.5$: VLM backbone, absolute EEF/joint control, 50-step chunks, 50Hz, flow matching
$\pi_0$: VLM backbone, delta EEF, 50-step chunks, 50Hz
$\pi_0$-FAST: FAST tokenized delta EEF, 50-step chunks, 15-50Hz
OpenVLA-OFT: VLM backbone, delta EEF, 8-step chunks, 3-10Hz
X-VLA: absolute EEF with 6D rotation, 10/20-dim, 32-step chunks
VLA-JEPA: Qwen3-VL-2B + JEPA future-state prediction, delta EEF, 7-step horizon, conditional flow-matching
MOTUS: MOT with Wan2.2-5B for video + separate VLM for action, 14-dim latent action (optical flow), 48-step chunks, 30Hz

training

this is a comparative study. training details are from the evaluated models:

$\pi_0.5$ finetuning on RoboTwin: pretrained $\pi_0.5$ checkpoint, full 27.5K RoboTwin 2.0 training trajectories, 60K gradient steps, AdamW ($\beta_1=0.9$, $\beta_2=0.95$), gradient clipping 1.0, cosine LR decay from $2.5 \times 10^{-5}$ to $2.5 \times 10^{-6}$, batch size 64, delta joint actions, openpi (JAX).

training data summary by model:

model	web video	robot data	task-specific	pretrain-free
$\pi_0$	–	>10k h cross-embodiment	5-100 h	no
$\pi_0.5$	web VQA/captioning	400h mobile manip	1-20 h	no
OpenVLA-OFT	–	970k cross-embodiment	20-300 demos	no
X-VLA	–	288k cross-embodiment	50 demos	no
Cosmos-Policy	–	–	185 demos	yes
GE-Act	–	3k h single-embodiment	1 h	no
LingBot-VA	–	16k h cross-embodiment	50 demos	no
VLA-JEPA	220k human ego	76k single-embodiment	100 demos	no
MOTUS	231k human ego	781k cross-embodiment	100 demos + 1k task-agnostic	no

evaluation

LIBERO-Plus (MuJoCo/robosuite, Franka Panda 7-DoF, 256 $\times$ 256, 7-dim delta EEF at 10Hz): 40 base tasks, 416 distractor objects, 7 perturbation dimensions. $\pi_0.5$ finetuned with 60K gradient steps.

LIBERO-Plus results (success rate %):

model	Original	Camera	Robot	Lang	Light	BG	Noise	Layout	Total
$\pi_0$	94.2	13.8	6.0	58.8	85.0	81.4	79.0	68.9	53.6
$\pi_0$ (rerun)	91.3	61.0	40.8	63.5	89.3	84.1	80.1	76.4	69.4
$\pi_0.5$	96.9	75.4	77.5	85.6	96.9	94.6	89.7	85.7	85.7
OpenVLA-OFT	97.6	55.6	21.7	81.0	92.7	91.0	78.6	68.7	67.9
X-VLA	98.1	23.4	89.7	75.7	88.2	96.0	62.7	71.8	71.4
RIPT-VLA	97.5	55.2	31.2	77.6	88.4	91.6	73.5	74.2	68.4
ABot-M0	98.6	60.4	67.9	86.4	96.2	91.6	86.4	82.6	80.5
VLA-JEPA	97.2	64.2	67.7	88.1	91.8	93.4	65.8	83.9	77.9
GE-Act	94.4	60.7	77.0	77.4	95.8	86.0	90.9	80.2	80.3
Cosmos-Policy	98.5	75.8	63.3	81.7	96.5	88.9	92.7	82.2	82.2

overall ranking: $\pi_0.5$ (85.7) > Cosmos-Policy (82.2) > ABot-M0 (80.5) > GE-Act (80.3) > VLA-JEPA (77.9) > X-VLA (71.4).

RoboTwin 2.0-Plus (SAPIEN/ManiSkill3, Aloha-AgileX 14-DoF bimanual, 320 $\times$ 240, 14-dim joint positions at 25-30Hz): 50 collaborative bimanual tasks, 731 distractor objects, 21 sub-dimensions across 7 perturbation types.

RoboTwin 2.0-Plus results (success rate %):

model	Original	Camera	Robot	Lang	Light	BG	Noise	Layout	Total
$\pi_0.5$	78.4	45.6	27.6	74.4	49.6	71.7	64.9	56.8	58.6
X-VLA	65.6	23.2	65.2	64.4	63.1	58.6	49.7	34.8	53.1
MOTUS	87.0	21.6	85.0	83.2	84.6	84.4	43.1	82.8	71.5
LingBot-VA	92.1	28.9	36.2	87.3	89.0	91.3	80.9	87.9	74.2

LingBot-VA dominates on 6/8 dimensions but collapses on camera perturbation (28.9) and robot initial state (36.2). $\pi_0.5$ best on camera perturbation but worst on robot initial state (27.6).

inference speed (same device):

model	chunk	time (ms)	relative
$\pi_0.5$	50	63	1.0$\times$
X-VLA	30	195	3.1$\times$
GE-Act	36	300	4.8$\times$
Cosmos-Policy	16	390	6.2$\times$
LingBot-VA (RW)	32	480	7.6$\times$
MOTUS	16	1175	18.6$\times$
LingBot-VA (RT)	32	5230	83.0$\times$

WAMs are 4.8-83$\times$ slower than $\pi_0.5$. the bottleneck is visual state denoising steps.

reproduction guide

no new model to reproduce. the paper introduces RoboTwin 2.0-Plus benchmark.

to reproduce evaluations: install SAPIEN (ManiSkill3) for RoboTwin 2.0-Plus and MuJoCo (robosuite) for LIBERO-Plus. use released checkpoints: X-VLA, MOTUS, LingBot-VA for RoboTwin; $\pi_0.5$ via openpi (JAX), $\pi_0$, OpenVLA-OFT, VLA-JEPA, GE-Act, Cosmos-Policy for LIBERO. run 8 configs per task (1 clean + 7 perturbation), 50 episodes each.

known gotchas: performance gap between JAX and PyTorch implementations of $\pi$-series models. DreamZero, GigaWorld-Policy, and Fast-WAM checkpoints not publicly available. C2 (spherical camera position) disabled by default to avoid instability.

notes

the most striking finding is that VLAs and WAMs have complementary robustness profiles. WAMs dominate on visual perturbations (noise, lighting, layout, background) – this is where spatiotemporal video priors help. but WAMs are surprisingly weak on camera viewpoint changes and robot initial state perturbations – this is where the video prior doesn’t help and may even hurt (the model expects a specific camera perspective).

$\pi_0.5$ achieves 85.7% total on LIBERO-Plus by being strong across the board rather than excelling in one dimension. this is attributed to its diverse training data (web VQA/captioning + 400h mobile manipulation + cross-embodiment data). the implication: training data diversity matters as much as architecture for robustness.

the speed gap is brutal. LingBot-VA on RoboTwin is 83$\times$ slower than $\pi_0.5$ (5230ms vs 63ms). even the fastest WAM (GE-Act at 300ms) is 4.8$\times$ slower. for real-time robot control at 10-50Hz, this latency is a deployment blocker. Cosmos-Policy reduces chunk size to 16 (vs 50 for $\pi_0.5$) to partially compensate but is still 6.2$\times$ slower.

the hybrid models (VLA-JEPA, MOTUS) sit between pure VLAs and WAMs – they get some robustness benefit from future state prediction without the full latency cost. VLA-JEPA at 77.9% on LIBERO-Plus outperforms all pure VLAs except $\pi_0.5$ and ABot-M0, suggesting that adding JEPA-style prediction heads to VLA backbones is a practical middle ground.

camera perturbation is the hardest dimension for every model. even $\pi_0.5$ drops from 96.9% to 75.4%. this is an open problem – neither language understanding (VLA) nor video priors (WAM) helps when the camera viewpoint shifts.