2026-04-01
R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation
Yuhao Zhang, Wanxi Dong, Yue Shi, Yi Liang, Jingnan Gao, Yao Mu et al.
problem
large-scale 3D foundation models (VGGT) provide strong spatial priors for manipulation but are too slow for real-time control (~73ms per frame). naive multi-view fusion via feature concatenation ignores camera geometry. depth sensors introduce noise on transparent/reflective surfaces (Depth Anything v2 + DP3 drops from 57.6% to 28.2%). DP3 requires depth sensors AND object bounding boxes, adding deployment complexity.
the core tension: you need 3D understanding for contact-rich tasks (stacking, inserting, handover) but can’t afford to run a 3D foundation model at every timestep.
architecture
R3DP is a plug-and-play 3D-aware perception module that replaces the visual encoder in diffusion policy. three components:
asynchronous fast-slow collaboration (AFSC)
- slow system ($\Phi_{\text{slow}}$): frozen VGGT processes multi-view RGB every $\tau$ frames. produces Slow 3D-aware Features (S3DF).
- fast system ($\Phi_{\text{fast}}$): lightweight TFPNet propagates S3DF to every intermediate frame. produces Real-time 3D-aware Features (R3DF).
- only 1 in $\tau$ frames goes through VGGT. $\tau$ is configurable at deployment without retraining.
flowchart LR
Frames[observation frames] --> Selector[frame selector]
Selector -->|every tau frames| VGGT[VGGT frozen]
Selector -->|every frame| TFPNet[TFPNet lightweight]
VGGT --> S3DF[S3DF features]
S3DF --> TFPNet
TFPNet --> R3DF[R3DF features]
R3DF --> MVFF[multi-view feature fuser]
MVFF --> DiffHead[diffusion policy head]
temporal feature prediction network (TFPNet)
per-view images encoded by DINOv2-S, then 4 alternating-attention transformer blocks for cross-view aggregation, then cross-attention with historical R3DF features. distilled from frozen VGGT using cosine similarity loss:
\[\mathcal{L}\_{\text{TFP}} = \sum\_{i=0}^{3} \left(1 - \cos(F^{\text{S3D}}\_t, F^{\text{R3D}}\_t)\right)\]training: 4 frames per sample with random interval 1-8 timesteps. only TFPNet parameters updated (VGGT frozen). outputs unified tokens from VGGT’s 24 layers of dense output.
multi-view feature fuser (MVFF)
each view’s image processed by ResNet encoder to 2D features. cross-attention fuses 2D features (Q) with R3DF features (K, V) per view. PRoPE (projective positional encoding) explicitly encodes camera intrinsics $K_i$ and extrinsics $T^{cw}_i$ using relative projective transformations between camera pairs. uses GTA-style block-diagonal attention with $d/8$-dim projective component + $d/4$-dim RoPE for x,y coordinates.
training
two-stage pipeline:
stage 1 – TFPNet pre-training:
| parameter | value |
|---|---|
| GPU | 8x NVIDIA A800 |
| sequences per task | ~200,000 (from 100 trajectories) |
| sequence length | 4 frames |
| random interval | 1-8 timesteps |
| views | 2 (head, front) |
| image size | 308 x 168 |
| batch size | 12 |
| optimizer | Adam ($\beta=(0.9, 0.999)$, $\epsilon=10^{-8}$) |
| learning rate | $10^{-4}$ |
| weight decay | $10^{-4}$ |
| epochs | 100 |
| mixed precision | bfloat16 |
stage 2 – policy training:
| parameter | value |
|---|---|
| GPU | 4x NVIDIA RTX 4090 |
| optimizer | AdamW ($\beta=(0.95, 0.999)$, $\epsilon=10^{-8}$) |
| learning rate | $2 \times 10^{-5}$ |
| weight decay | $10^{-6}$ |
| LR schedule | cosine with 500 warmup steps |
| epochs | 600 |
| batch size | 8 |
| horizon | 8 |
| $n_{\text{obs_steps}}$ | 1 |
| $n_{\text{action_steps}}$ | 8 |
| DDIM inference steps | 10 |
| vision backbones | frozen |
evaluation
simulation – RoboTwin benchmark (100 trials/task)
| task | DP | DP3 | $\pi_0$ | R3DP ($\tau=4$) |
|---|---|---|---|---|
| block hammer beat | 0% | 49% | 47% | 77% |
| block handover | 1% | 48% | 71% | 95% |
| blocks stack easy | 6% | 26% | 79% | 69% |
| container place | 48% | 89% | 77% | 63% |
| put apple cabinet | 92% | 98% | 64% | 100% |
| tube insert | 92% | 97% | 68% | 97% |
| diverse bottles pick | 16% | 34% | 47% | 31% |
| average | 36.1% | 57.6% | 59.9% | 69.0% |
R3DP wins on 7/10 tasks, $\pi_0$ wins on 3 (blocks stack, container place, diverse bottles). R3DP average: +11.4pp over DP3, +9.1pp over $\pi_0$.
latency
| config | obs encoder (ms) | action expert (ms) |
|---|---|---|
| DP+VGGT (naive) | 73.1 | – |
| DP+VGGT+MVFF | 78.3 | – |
| R3DP ($\tau=4$) | 50.5 | ~55 |
| R3DP ($\tau=8$) | 40.3 | ~57 |
$\tau=8$: 44.8% latency reduction vs naive DP+VGGT. $\tau=8$ averages 65.7% (vs 69.0% at $\tau=4$) – only 3.3pp accuracy loss for 20% faster inference.
real-world (30 trials/task, single RTX 4090)
| task | DP | DP3 | R3DP |
|---|---|---|---|
| place shoe | 46.7% | 50.0% | 86.7% |
| place glass cup | 23.3% | 56.7% | 83.3% |
| pick peach | 20.0% | 30.0% | 50.0% |
| stack bowls | 33.3% | 56.7% | 66.7% |
| average | 30.8% | 48.4% | 71.7% |
obs encoder: 62.2ms (vs 105.1ms for DP+VGGT+MVFF = 40.8% reduction).
reproduction guide
code at https://github.com/dazazh/R3DP.
- clone repo, install dependencies
- stage 1: pre-train TFPNet on frozen VGGT features. need ~200K 4-frame sequences from 100 demos per task. 8x A800, ~100 epochs, ~4 hours.
- stage 2: train diffusion policy head with AFSC+MVFF. 4x RTX 4090, 600 epochs, ~12 hours.
- deploy: choose $\tau$ at inference time (4 or 8 recommended). single RTX 4090 sufficient.
compute cost: ~$400 for TFPNet pretraining + ~$100 for policy training = ~$500 total per task.
gotchas: VGGT and TFPNet must be frozen during policy training (gradients only update diffusion head). TFPNet needs DINOv2-S weights. bfloat16 required for VGGT inference.
notes
the fast-slow pattern is elegant: decouple expensive 3D reasoning from every-frame policy execution. TFPNet as a temporal propagator is essentially a lightweight “3D feature cache invalidator” – it predicts how 3D features change between keyframes. the 44.8% latency reduction at $\tau=8$ with only 3.3pp accuracy loss is a strong tradeoff.
PRoPE encoding camera geometry is important – without it, multi-view fusion degrades. this connects to the broader pattern of encoding physical priors into neural architectures.
the gap to $\pi_0$ on some tasks (blocks stack, diverse bottles) suggests 3D-aware perception alone doesn’t fully close the gap with large-scale VLM priors. VLA models bring language understanding and broad pretraining that pure perception modules lack. this relates to the can-vla-models-eliminate-separate-world-models angle – R3DP is evidence that 3D perception modules are complementary to, not replacements for, VLAs.