2026-04-01
Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds
Andrew Choi, Xinjie Wang, Zhizhong Su, Wei Xu
problem
real-world RL fine-tuning of VLA models achieves high success rates (>90%) but only in narrowly scoped settings. scaling scene/object diversity in the physical world is prohibitively expensive. this paradoxically transforms broadly pretrained VLAs into overfitted, scene-specific policies. existing sim-to-real RL approaches (SimpleVLA-RL, ReBot, VLA-RFT) train in only 3 hand-designed scenes, severely limiting generalization.
the gap: nobody has shown how to scale scene distribution for sim-to-real RL beyond a handful of manually designed environments. generative 3D models (TRELLIS, WorldGen, Holodeck) create static assets or procedural environments with limited physical interactivity. EmbodiedGen does language-driven interactive 3D scene generation but hasn’t been integrated with VLA RL training at scale.
architecture
base model: $\pi_0$
built on the $\pi_0$ architecture (~3B total params): Gemma 2B VLM backbone + SigLip 400M vision encoder + 300M flow-matching action expert head. pretrained on BridgeV2 with rectified flow-matching:
\[\mathcal{L}\_{\text{flow}}(\theta) = \mathbb{E}\left[\|v\_\theta(A^t\_\tau, KV\_\theta(o\_t), \tau) - (A^t\_1 - \varepsilon)\|^2\right]\]ODE integration from $\tau=0$ to $\tau=1$ with $K=10$ steps. action chunk size $C=4$, actions are end-effector delta-pose + binary gripper $\in \mathbb{R}^{C \times 7}$.
PPOFlow (RL fine-tuning)
adds two heads to $\pi_0$:
- noise head $\sigma_\phi$: shallow MLP, injects learnable gaussian noise into flow-matching integration. converts deterministic policy to stochastic. output range $\log \in [-2.5, -2.0]$. fully fine-tuned.
- value head $V_\psi$: shallow MLP for GAE advantage computation. fully fine-tuned.
power-scaled importance ratio: $\hat{r}_t = (\pi_\theta / \pi_{\theta,\text{old}})^s$ where $s=0.2$ – crucial for training stability. PPO clipped objective with $\epsilon=0.2$.
key insight: reducing $K$ from 10 to 1 during RL converts multi-step flow matching into single-step gaussian policy. no performance degradation, 2.36x inference speedup with torch.compile.
fine-tuning strategy
LoRA (rank 32) on the full VLM (including SigLip) + fully fine-tune action expert + value/noise heads. alternatives tested:
- (a) freeze VLM, LoRA action head: model collapse
- (b) freeze VLM, full action head: model collapse
- (c) freeze SigLip, LoRA Gemma + LoRA action head: few pp lower
- (d) LoRA VLM + LoRA action head: worse than full fine-tune on action head
generative 3D world pipeline
GPT-4o converts task descriptions to structured scene graphs with semantic roles (background, context, distractors, targets, robot) and spatial relations. fed into extended EmbodiedGen to produce interactive ManiSkill 3 environments. automated QA pipeline with three GPT-4o checkers: semantic appearance (83.3% pass), mesh geometry (75.2% pass), text-to-3D alignment (91.9% pass). 85% acceptance rate. 100 generated environments with 516 unique object assets (avg 5.16 interactive objects per scene).
flowchart LR
Task[task description] --> GPT4o[GPT-4o scene designer]
GPT4o --> SceneGraph[structured scene graph]
SceneGraph --> EmbodiedGen[EmbodiedGen engine]
EmbodiedGen --> QA[GPT-4o QA pipeline]
QA --> ManiSkill[ManiSkill 3 simulator]
ManiSkill --> VLA[VLA RL training]
training
hardware: 8x NVIDIA RTX 6000 Ada (training), 1x RTX 4090 (inference)
duration: 5 days
| parameter | value |
|---|---|
| environments | 192 parallel |
| batch size | 19,200 |
| mini-batch size | 1,920 |
| episode length | 25 steps |
| discount factor $\gamma$ | 0.99 |
| learning rate | $2 \times 10^{-5}$ |
| gradient clip (global norm) | 0.5 |
| clipping ratio $\epsilon$ | 0.2 |
| importance ratio scale $s$ | 0.2 |
| LoRA rank | 32 |
| integration steps $K$ | 1 (down from 10) |
| control frequency | 5 Hz |
reward: sparse rule-based success from simulator. success = contact(A,B) AND NOT contact(A,table) AND NOT contact(A,robot).
domain randomization: object position $\pm 0.15$m, orientation $[0, 2\pi]$, robot joint perturbation $\pm 0.1$ rad, camera position $\pm 0.05$m, ambient light RGB $[0, 0.6]^3$.
evaluation
simulation (Table 1)
| scenes $N$ | EG all SR (%) | SE SR (%) |
|---|---|---|
| $N=0$ (pretrained) | 9.7 | 23.7 |
| $N=3$ (manually designed) | 36.0 | 96.7 |
| $N=1$ | 51.6 | 36.1 |
| $N=10$ | 72.1 | 54.3 |
| $N=25$ | 78.3 | 70.1 |
| $N=50$ | 79.2 | 68.4 |
| $N=100$ | 79.8 | 74.3 |
$N=3$ manual scenes: 96.7% on training scenes but only 36.0% on generated – severe overfitting (60.7pp gap). $N=100$ achieves 79.8% on generated AND 74.3% on manual (never trained on them). scaling scene diversity from 1 to 50 gives +24.7pp OOD improvement.
sim-to-real (Table 2, 12 scenes, 240 trials)
| metric | pretrained | $N=100$ RL |
|---|---|---|
| overall success rate | 21.7% | 75.0% |
| partial success rate | 45.0% | 88.3% |
| dynamics failure rate | 66.7% | 18.3% |
| time to finish | 11.5s | 10.2s |
scene 10 (OOD object): 0% to 50%. scene 11 (OOD stacking composition): 20% to 50%, partial success 50% to 100%.
inference latency
| $K$ | torch.compile (s) | speedup vs $K=10$ |
|---|---|---|
| 10 (pretrain) | 0.172 | baseline |
| 4 | 0.107 | 1.61x |
| 2 | 0.086 | 2.00x |
| 1 | 0.073 | 2.36x |
reproduction guide
no code released. to reproduce:
- obtain $\pi_0$ checkpoint from https://github.com/allenzren/open-pi-zero
- set up ManiSkill 3 with 192 parallel envs
- generate 100 scenes using EmbodiedGen + GPT-4o scene designer (46.8 min/scene on single 4090, or ~2 min from pre-built library)
- add noise head (shallow MLP, output range $\log \in [-2.5, -2.0]$) and value head (shallow MLP) to $\pi_0$
- train with PPOFlow: LoRA rank 32 on VLM, full fine-tune on action expert + heads, 5 days on 8x RTX 6000 Ada
- deploy with PD control + gravity compensation on Interbotix WidowX 250S
compute cost: ~3840 GPU-hours on RTX 6000 Ada (~$2300 on vast.ai at $0.60/hr).
notes
the scene scaling results are the most important finding: $N=50$ gives 77.9% OOD vs 53.2% at $N=1$. this is a clean ablation that demonstrates scene distribution breadth, not RL algorithm complexity, is the key to generalization. the $K=1$ result (no performance loss, 2.36x speedup) is practically significant – single-step gaussian sampling is much cheaper than multi-step ODE integration.
the EmbodiedGen + GPT-4o pipeline is expensive (46 min/scene) but produces reusable assets. 85% acceptance rate means ~120 scenes need to be generated for 100 usable ones. the QA pipeline is a clever way to automate quality control but adds cost.
connection to existing notes: this reinforces synthetic-data-bridges-cross-embodiment-gap and inference-time-guidance-pattern-robotics. the 3D world generation approach is different from 3DGS reconstruction (AirVLA) but serves the same purpose: bridging the data gap. the sim-to-real transfer without system identification (just PD + gravity compensation) is notably simple.