Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

problem

real-world RL fine-tuning of VLA models achieves high success rates (>90%) but only in narrowly scoped settings. scaling scene/object diversity in the physical world is prohibitively expensive. this paradoxically transforms broadly pretrained VLAs into overfitted, scene-specific policies. existing sim-to-real RL approaches (SimpleVLA-RL, ReBot, VLA-RFT) train in only 3 hand-designed scenes, severely limiting generalization.

the gap: nobody has shown how to scale scene distribution for sim-to-real RL beyond a handful of manually designed environments. generative 3D models (TRELLIS, WorldGen, Holodeck) create static assets or procedural environments with limited physical interactivity. EmbodiedGen does language-driven interactive 3D scene generation but hasn’t been integrated with VLA RL training at scale.

architecture

base model: $\pi_0$

built on the $\pi_0$ architecture (~3B total params): Gemma 2B VLM backbone + SigLip 400M vision encoder + 300M flow-matching action expert head. pretrained on BridgeV2 with rectified flow-matching:

\[\mathcal{L}\_{\text{flow}}(\theta) = \mathbb{E}\left[\|v\_\theta(A^t\_\tau, KV\_\theta(o\_t), \tau) - (A^t\_1 - \varepsilon)\|^2\right]\]

ODE integration from $\tau=0$ to $\tau=1$ with $K=10$ steps. action chunk size $C=4$, actions are end-effector delta-pose + binary gripper $\in \mathbb{R}^{C \times 7}$.

PPOFlow (RL fine-tuning)

adds two heads to $\pi_0$:

noise head $\sigma_\phi$: shallow MLP, injects learnable gaussian noise into flow-matching integration. converts deterministic policy to stochastic. output range $\log \in [-2.5, -2.0]$. fully fine-tuned.
value head $V_\psi$: shallow MLP for GAE advantage computation. fully fine-tuned.

power-scaled importance ratio: $\hat{r}_t = (\pi_\theta / \pi_{\theta,\text{old}})^s$ where $s=0.2$ – crucial for training stability. PPO clipped objective with $\epsilon=0.2$.

key insight: reducing $K$ from 10 to 1 during RL converts multi-step flow matching into single-step gaussian policy. no performance degradation, 2.36x inference speedup with torch.compile.

fine-tuning strategy

LoRA (rank 32) on the full VLM (including SigLip) + fully fine-tune action expert + value/noise heads. alternatives tested:

(a) freeze VLM, LoRA action head: model collapse
(b) freeze VLM, full action head: model collapse
(c) freeze SigLip, LoRA Gemma + LoRA action head: few pp lower
(d) LoRA VLM + LoRA action head: worse than full fine-tune on action head

generative 3D world pipeline

GPT-4o converts task descriptions to structured scene graphs with semantic roles (background, context, distractors, targets, robot) and spatial relations. fed into extended EmbodiedGen to produce interactive ManiSkill 3 environments. automated QA pipeline with three GPT-4o checkers: semantic appearance (83.3% pass), mesh geometry (75.2% pass), text-to-3D alignment (91.9% pass). 85% acceptance rate. 100 generated environments with 516 unique object assets (avg 5.16 interactive objects per scene).

flowchart LR
    Task[task description] --> GPT4o[GPT-4o scene designer]
    GPT4o --> SceneGraph[structured scene graph]
    SceneGraph --> EmbodiedGen[EmbodiedGen engine]
    EmbodiedGen --> QA[GPT-4o QA pipeline]
    QA --> ManiSkill[ManiSkill 3 simulator]
    ManiSkill --> VLA[VLA RL training]

training

hardware: 8x NVIDIA RTX 6000 Ada (training), 1x RTX 4090 (inference)

duration: 5 days

parameter	value
environments	192 parallel
batch size	19,200
mini-batch size	1,920
episode length	25 steps
discount factor $\gamma$	0.99
learning rate	$2 \times 10^{-5}$
gradient clip (global norm)	0.5
clipping ratio $\epsilon$	0.2
importance ratio scale $s$	0.2
LoRA rank	32
integration steps $K$	1 (down from 10)
control frequency	5 Hz

reward: sparse rule-based success from simulator. success = contact(A,B) AND NOT contact(A,table) AND NOT contact(A,robot).

domain randomization: object position $\pm 0.15$m, orientation $[0, 2\pi]$, robot joint perturbation $\pm 0.1$ rad, camera position $\pm 0.05$m, ambient light RGB $[0, 0.6]^3$.

evaluation

simulation (Table 1)

scenes $N$	EG all SR (%)	SE SR (%)
$N=0$ (pretrained)	9.7	23.7
$N=3$ (manually designed)	36.0	96.7
$N=1$	51.6	36.1
$N=10$	72.1	54.3
$N=25$	78.3	70.1
$N=50$	79.2	68.4
$N=100$	79.8	74.3

$N=3$ manual scenes: 96.7% on training scenes but only 36.0% on generated – severe overfitting (60.7pp gap). $N=100$ achieves 79.8% on generated AND 74.3% on manual (never trained on them). scaling scene diversity from 1 to 50 gives +24.7pp OOD improvement.

sim-to-real (Table 2, 12 scenes, 240 trials)

metric	pretrained	$N=100$ RL
overall success rate	21.7%	75.0%
partial success rate	45.0%	88.3%
dynamics failure rate	66.7%	18.3%
time to finish	11.5s	10.2s

scene 10 (OOD object): 0% to 50%. scene 11 (OOD stacking composition): 20% to 50%, partial success 50% to 100%.

inference latency

$K$	torch.compile (s)	speedup vs $K=10$
10 (pretrain)	0.172	baseline
4	0.107	1.61x
2	0.086	2.00x
1	0.073	2.36x

reproduction guide

no code released. to reproduce:

obtain $\pi_0$ checkpoint from https://github.com/allenzren/open-pi-zero
set up ManiSkill 3 with 192 parallel envs
generate 100 scenes using EmbodiedGen + GPT-4o scene designer (46.8 min/scene on single 4090, or ~2 min from pre-built library)
add noise head (shallow MLP, output range $\log \in [-2.5, -2.0]$) and value head (shallow MLP) to $\pi_0$
train with PPOFlow: LoRA rank 32 on VLM, full fine-tune on action expert + heads, 5 days on 8x RTX 6000 Ada
deploy with PD control + gravity compensation on Interbotix WidowX 250S

compute cost: ~3840 GPU-hours on RTX 6000 Ada (~$2300 on vast.ai at $0.60/hr).

notes

the scene scaling results are the most important finding: $N=50$ gives 77.9% OOD vs 53.2% at $N=1$. this is a clean ablation that demonstrates scene distribution breadth, not RL algorithm complexity, is the key to generalization. the $K=1$ result (no performance loss, 2.36x speedup) is practically significant – single-step gaussian sampling is much cheaper than multi-step ODE integration.

the EmbodiedGen + GPT-4o pipeline is expensive (46 min/scene) but produces reusable assets. 85% acceptance rate means ~120 scenes need to be generated for 100 usable ones. the QA pipeline is a clever way to automate quality control but adds cost.

connection to existing notes: this reinforces synthetic-data-bridges-cross-embodiment-gap and inference-time-guidance-pattern-robotics. the 3D world generation approach is different from 3DGS reconstruction (AirVLA) but serves the same purpose: bridging the data gap. the sim-to-real transfer without system identification (just PD + gravity compensation) is notably simple.