EgoSim: Egocentric World Simulator for Embodied Interaction Generation

Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, Xudong Xu et al.

problem

egocentric world simulators for embodied AI need to generate spatially consistent interaction videos and maintain a persistent world state across multi-step interactions. prior work has two fundamental limitations: (1) lack of explicit 3D grounding causes structural drift under viewpoint changes (Hand2World, InterDyn, CosHand), and (2) static scene reconstruction that is never updated after interactions, making multi-stage manipulation impossible (DWM reconstructs the scene once and never updates it).

DWM (Decoupling World Model) explicitly decouples static 3D scenes from action-induced dynamics by conditioning on rendered point maps and hand meshes, but the scene is reconstructed once and never updated – it cannot handle sequential manipulation where objects move between clips. Hand2World distills bidirectional video diffusion into a causal autoregressive generator but has no explicit 3D representation, limiting spatial consistency under large viewpoint changes and restricting it to tabletop scenes with fewer than 8K clips. InterDyn extends SVD with a ControlNet branch conditioned on binary hand masks but has no 3D grounding and no persistent state.

the core question: can you build a closed-loop simulator that generates spatially consistent egocentric videos and updates the underlying 3D scene state after each interaction, enabling continuous multi-stage simulation?

architecture

flowchart LR
    S_k[state S_{k-1}] --> Pi[render Pi]
    H_k[action H_k] --> O_gen[observation sim]
    Pi --> O_gen
    O_gen --> O_k[generated O_k]
    O_k --> U[state update U]
    S_k --> U
    U --> S_next[state S_k]
    S_next --> Pi

EgoSim has two core modules:

1. Geometry-action-aware Observation Simulation – a 14B parameter video diffusion model based on Wan-2.1-Fun-14B-InP (Diffusion Transformer with Flow Matching). the DiT is fine-tuned while T5 text encoder, video VAE, and CLIP image encoder are frozen.

input concatenation in latent space (52 channels total):

\[z\_{\text{in}}^{(t)} = \text{Concat}(z\_t, z\_{\text{bg}}, z\_{\text{hand}}, M)\]

where $z_t$ is the noisy latent, $z_{\text{bg}}$ is the encoded rendered scene observation video (point cloud rendered along camera trajectory), $z_{\text{hand}}$ is the encoded 3D hand keypoint video (21-keypoint MANO skeleton projected to 2D using perspective projection for depth-dependent foreshortening), and $M$ is a binary mask indicating unobserved regions requiring synthesis.

the inpainting prior means the model initializes with pretrained inpainting weights – it acts as an identity function with generative prior, preserving known background and only generating in action-conditioned and unobserved regions. this is critical: without the mask, the model would hallucinate over the entire frame.

loss is standard denoising score matching:

\[\mathcal{L}\_{\text{gen}} = \mathbb{E}\_{z\_0, t, \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ \left\| \epsilon - \epsilon\_\theta \left( z\_{\text{in}}^{(t)}, t \right) \right\|_2^2 \right]\]

output: 61 frames at 16 FPS (~3.8 seconds per clip), 832 $\times$ 480 pixels. flow matching with 50 steps, CFG scale 1.0.

action representation: 21-keypoint MANO skeleton for humans, mapped to simplified thumb + index finger skeleton with gripper opening state for robots. this universal keypoint representation enables cross-embodiment transfer (human hand demos to robot policies).

2. Interaction-aware State Updating – training-free module that reconstructs and updates 3D scene state from generated observations:

state reconstruction: SAM3 extracts instance masks guided by Grounding-DINO, DepthAnything3 estimates per-frame depths and camera poses via dual-pass DROID-SLAM with multi-view depth alignment
object state update: VLM (Qwen-2.5) identifies interaction objects, SAM3 tracks them in 3D, five-stage hierarchical filtering (semantic tagging, spatial proximity via IoU > 0.15, depth refinement with median diff < 0.15m, temporal completeness, serialization)
incremental fusion: TSDF fusion with Sim3 Umeyama alignment, voxel size 0.003m, max depth 3.0m, statistical outlier removal

training

8 $\times$ NVIDIA H200 GPUs. NOT reproducible on consumer hardware.

main training: AdamW, lr = $10^{-5}$, per-GPU batch size 4 (effective batch 32), 4000 steps. 400K clips total: 240K from EgoDex (fine-grained tabletop interactions) + 160K from EgoVid (in-the-wild egocentric).

cross-embodiment finetuning: AgiBot-World-Beta (30 tasks, 100K clips), 50K train / 150 test, 200 steps. with pretrain achieves PSNR 18.670 vs 15.180 without – +3.490 improvement confirms the universal keypoint representation transfers across embodiments.

real-world finetuning: EgoCap (50 clips in supermarket, shelf interactions), 30 train / 20 test, 50 steps, per-GPU batch 1.

inference: single H200 or A100 80GB sufficient. 14B model requires ~28GB+ GPU memory.

evaluation

EgoDex tabletop (single clip):

model	PSNR	SSIM	LPIPS	Depth ERR	Cam ERR
Wan-2.1-14B-InP	17.998	0.447	0.708	42.335	0.0300
Mask2IV	20.622	0.814	0.299	38.339	0.0181
InterDyn	22.250	0.830	0.255	44.345	0.0226
EgoSim	25.056	0.896	0.170	8.888	0.0013

EgoSim dominates on spatial consistency: Depth ERR reduced 5x vs best baseline (8.888 vs 38.339), Cam ERR reduced 17x (0.0013 vs 0.0226).

EgoVid in-the-wild (single clip):

model	PSNR	SSIM	LPIPS	Depth ERR	Cam ERR
InterDyn	14.612	0.466	0.484	38.180	0.0308
EgoSim	16.684	0.509	0.421	19.260	0.0105

continuous generation (121 frames = 2 consecutive clips on EgoDex):

setting	PSNR	SSIM	LPIPS	Depth ERR	Cam ERR
single clip	25.056	0.896	0.170	8.888	0.0013
continuous	19.165	0.835	0.220	10.943	0.0017

degradation from cumulative error in simulated artifacts and noise in updated 3D states. spatial consistency (Cam ERR) barely degrades (+0.0004), which validates the state updating pipeline.

ablation on EgoDex:

variant	PSNR	SSIM	LPIPS
w/o trajectory	23.380	0.845	0.244
w/o mask	23.988	0.886	0.186
full EgoSim	25.056	0.896	0.170

both components contribute: trajectory rendering gives +1.676 PSNR, mask inpainting gives +1.068 PSNR.

no formal statistical significance tests reported. wins on ALL metrics in ALL settings.

reproduction guide

code not yet released (“Codes and datasets will be open soon”). project page: https://egosimulator.github.io

dependencies: Wan-2.1-Fun-14B-InP checkpoint (~28GB+), DepthAnything3, SAM3, HaMeR, DROID-SLAM, GeoCalib, Grounding-DINO, Open3D (TSDF), ARTDECO (for EgoCap), Qwen-Image-Edit / Qwen3VL-8B.

full training requires 8 $\times$ H200 GPUs – hundreds of GPU-hours. cross-embodiment finetuning (200 steps on 50K clips) is feasible on a single high-end GPU with gradient accumulation. few-shot real-world (50 steps on 30 clips) is feasible on a single consumer GPU.

known gotchas: monocular depth/camera estimation fails in heavily occluded or highly dynamic environments. HaMeR is unstable on in-the-wild egocentric videos (motion blur, occlusion, low-res) – the filtering pipeline is essential. continuous generation accumulates error.

notes

the key insight is that world state persistence through 3D point cloud updating is what enables multi-stage simulation. previous world models generate one clip and forget – EgoSim generates, reconstructs the 3D state, updates it, then generates the next clip. this closed-loop design is what separates a “video generator” from a “world simulator.”

the universal keypoint representation (MANO skeleton for humans, simplified thumb+index for robots) enabling cross-embodiment transfer with just 200 finetuning steps is notable. this suggests that the action representation space is more transferable than the visual appearance space.

the TSDF-based state updating is training-free – it uses off-the-shelf SLAM and depth estimation. this is both a strength (no additional training) and a weakness (cascading errors from monocular depth estimation accumulate). the Cam ERR barely increasing in continuous generation (+0.0004) suggests the 3D consistency holds well, but the PSNR drop of 5.891 points indicates visual quality degrades.

14B parameters is expensive. the question is whether the 3D state updating approach can work with smaller video diffusion backbones. if a 2B model can achieve comparable spatial consistency with the same state updating pipeline, this becomes much more practical for robotics.

the scalable data pipeline (500K clips from web-scale monocular egocentric videos) is important – it avoids the bottleneck of paired egocentric-exocentric captures that prior work (PlayerOne) requires.