2026-04-01

Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

Andrew Choi, Xinjie Wang, Zhizhong Su, Wei Xu

VLA sim-to-real RL 3D-generation

problem

real-world RL fine-tuning of VLA models achieves high success rates (>90%) but only in narrowly scoped settings. scaling scene/object diversity in the physical world is prohibitively expensive. this paradoxically transforms broadly pretrained VLAs into overfitted, scene-specific policies. existing sim-to-real RL approaches (SimpleVLA-RL, ReBot, VLA-RFT) train in only 3 hand-designed scenes, severely limiting generalization.

the gap: nobody has shown how to scale scene distribution for sim-to-real RL beyond a handful of manually designed environments. generative 3D models (TRELLIS, WorldGen, Holodeck) create static assets or procedural environments with limited physical interactivity. EmbodiedGen does language-driven interactive 3D scene generation but hasn’t been integrated with VLA RL training at scale.

architecture

base model: $\pi_0$

built on the $\pi_0$ architecture (~3B total params): Gemma 2B VLM backbone + SigLip 400M vision encoder + 300M flow-matching action expert head. pretrained on BridgeV2 with rectified flow-matching:

\[\mathcal{L}\_{\text{flow}}(\theta) = \mathbb{E}\left[\|v\_\theta(A^t\_\tau, KV\_\theta(o\_t), \tau) - (A^t\_1 - \varepsilon)\|^2\right]\]

ODE integration from $\tau=0$ to $\tau=1$ with $K=10$ steps. action chunk size $C=4$, actions are end-effector delta-pose + binary gripper $\in \mathbb{R}^{C \times 7}$.

PPOFlow (RL fine-tuning)

adds two heads to $\pi_0$:

  • noise head $\sigma_\phi$: shallow MLP, injects learnable gaussian noise into flow-matching integration. converts deterministic policy to stochastic. output range $\log \in [-2.5, -2.0]$. fully fine-tuned.
  • value head $V_\psi$: shallow MLP for GAE advantage computation. fully fine-tuned.

power-scaled importance ratio: $\hat{r}_t = (\pi_\theta / \pi_{\theta,\text{old}})^s$ where $s=0.2$ – crucial for training stability. PPO clipped objective with $\epsilon=0.2$.

key insight: reducing $K$ from 10 to 1 during RL converts multi-step flow matching into single-step gaussian policy. no performance degradation, 2.36x inference speedup with torch.compile.

fine-tuning strategy

LoRA (rank 32) on the full VLM (including SigLip) + fully fine-tune action expert + value/noise heads. alternatives tested:

  • (a) freeze VLM, LoRA action head: model collapse
  • (b) freeze VLM, full action head: model collapse
  • (c) freeze SigLip, LoRA Gemma + LoRA action head: few pp lower
  • (d) LoRA VLM + LoRA action head: worse than full fine-tune on action head

generative 3D world pipeline

GPT-4o converts task descriptions to structured scene graphs with semantic roles (background, context, distractors, targets, robot) and spatial relations. fed into extended EmbodiedGen to produce interactive ManiSkill 3 environments. automated QA pipeline with three GPT-4o checkers: semantic appearance (83.3% pass), mesh geometry (75.2% pass), text-to-3D alignment (91.9% pass). 85% acceptance rate. 100 generated environments with 516 unique object assets (avg 5.16 interactive objects per scene).

flowchart LR
    Task[task description] --> GPT4o[GPT-4o scene designer]
    GPT4o --> SceneGraph[structured scene graph]
    SceneGraph --> EmbodiedGen[EmbodiedGen engine]
    EmbodiedGen --> QA[GPT-4o QA pipeline]
    QA --> ManiSkill[ManiSkill 3 simulator]
    ManiSkill --> VLA[VLA RL training]

training

hardware: 8x NVIDIA RTX 6000 Ada (training), 1x RTX 4090 (inference)

duration: 5 days

parameter value
environments 192 parallel
batch size 19,200
mini-batch size 1,920
episode length 25 steps
discount factor $\gamma$ 0.99
learning rate $2 \times 10^{-5}$
gradient clip (global norm) 0.5
clipping ratio $\epsilon$ 0.2
importance ratio scale $s$ 0.2
LoRA rank 32
integration steps $K$ 1 (down from 10)
control frequency 5 Hz

reward: sparse rule-based success from simulator. success = contact(A,B) AND NOT contact(A,table) AND NOT contact(A,robot).

domain randomization: object position $\pm 0.15$m, orientation $[0, 2\pi]$, robot joint perturbation $\pm 0.1$ rad, camera position $\pm 0.05$m, ambient light RGB $[0, 0.6]^3$.

evaluation

simulation (Table 1)

scenes $N$ EG all SR (%) SE SR (%)
$N=0$ (pretrained) 9.7 23.7
$N=3$ (manually designed) 36.0 96.7
$N=1$ 51.6 36.1
$N=10$ 72.1 54.3
$N=25$ 78.3 70.1
$N=50$ 79.2 68.4
$N=100$ 79.8 74.3

$N=3$ manual scenes: 96.7% on training scenes but only 36.0% on generated – severe overfitting (60.7pp gap). $N=100$ achieves 79.8% on generated AND 74.3% on manual (never trained on them). scaling scene diversity from 1 to 50 gives +24.7pp OOD improvement.

sim-to-real (Table 2, 12 scenes, 240 trials)

metric pretrained $N=100$ RL
overall success rate 21.7% 75.0%
partial success rate 45.0% 88.3%
dynamics failure rate 66.7% 18.3%
time to finish 11.5s 10.2s

scene 10 (OOD object): 0% to 50%. scene 11 (OOD stacking composition): 20% to 50%, partial success 50% to 100%.

inference latency

$K$ torch.compile (s) speedup vs $K=10$
10 (pretrain) 0.172 baseline
4 0.107 1.61x
2 0.086 2.00x
1 0.073 2.36x

reproduction guide

no code released. to reproduce:

  1. obtain $\pi_0$ checkpoint from https://github.com/allenzren/open-pi-zero
  2. set up ManiSkill 3 with 192 parallel envs
  3. generate 100 scenes using EmbodiedGen + GPT-4o scene designer (46.8 min/scene on single 4090, or ~2 min from pre-built library)
  4. add noise head (shallow MLP, output range $\log \in [-2.5, -2.0]$) and value head (shallow MLP) to $\pi_0$
  5. train with PPOFlow: LoRA rank 32 on VLM, full fine-tune on action expert + heads, 5 days on 8x RTX 6000 Ada
  6. deploy with PD control + gravity compensation on Interbotix WidowX 250S

compute cost: ~3840 GPU-hours on RTX 6000 Ada (~$2300 on vast.ai at $0.60/hr).

notes

the scene scaling results are the most important finding: $N=50$ gives 77.9% OOD vs 53.2% at $N=1$. this is a clean ablation that demonstrates scene distribution breadth, not RL algorithm complexity, is the key to generalization. the $K=1$ result (no performance loss, 2.36x speedup) is practically significant – single-step gaussian sampling is much cheaper than multi-step ODE integration.

the EmbodiedGen + GPT-4o pipeline is expensive (46 min/scene) but produces reusable assets. 85% acceptance rate means ~120 scenes need to be generated for 100 usable ones. the QA pipeline is a clever way to automate quality control but adds cost.

connection to existing notes: this reinforces synthetic-data-bridges-cross-embodiment-gap and inference-time-guidance-pattern-robotics. the 3D world generation approach is different from 3DGS reconstruction (AirVLA) but serves the same purpose: bridging the data gap. the sim-to-real transfer without system identification (just PD + gravity compensation) is notably simple.