AnchorVLA: Anchored Diffusion for Efficient End-to-End Mobile Manipulation

Jia Syuen Lim, Zhizhen Zhang, Peter Böhm, Brendan Tidd, Zi Huang, Yadan Luo et al.

problem

Mobile manipulation requires coordinating a non-holonomic mobile base with a high-DOF manipulator in cluttered, dynamically evolving environments. The action distribution $p(A_t \mid o_t, \ell)$ is inherently multimodal — a single instruction admits multiple valid approach routes, base poses, and grasp trajectories. Existing approaches face a trilemma:

Deterministic VLAs (OpenVLA, VLA-Adapter, SmolVLA) use L1 regression heads that collapse modes toward a conditional average, producing smooth but physically inexecutable trajectories (mode-averaging). This failure is severe in mobile manipulation where multiple distinct solutions exist, unlike tabletop settings where actions concentrate around a narrow set.
Full diffusion policies (Diffusion Policy, AC-DiT, RDT-1B) model multimodality faithfully but require many iterative denoising steps (e.g., 50–100), creating an inference bottleneck incompatible with real-time mobile manipulation control. AC-DiT, the strongest mobile-manipulation baseline at 55.6% on MS-HAB, relies on expensive point-cloud inputs and full diffusion sampling.
Action chunking amortizes inference cost but introduces partially open-loop execution: as perception evolves, precomputed actions become misaligned, and small kinematic errors compound into drift — especially problematic when navigation and manipulation are tightly coupled.

AnchorVLA simultaneously addresses all three: it preserves multimodal action generation via anchored diffusion, truncates denoising to ~5–10 steps for low-latency inference, and mitigates chunking-induced drift with a lightweight residual correction module.

architecture

AnchorVLA has three components: (1) a LoRA-fine-tuned VLA backbone, (2) an anchored truncated diffusion action head with DiT conditioning, and (3) a residual correction MLP for test-time self-correction.

VLA Backbone

Built on VLA-Adapter architecture using Qwen-2.5 (0.5B) as the frozen backbone with LoRA (rank 64). At each timestep $t$, given observation $o_t = {I_t^g, I_t^w, s_t}$ (third-person image, wrist image, proprioceptive state) and instruction $\ell$, the backbone extracts a compact latent context:

\[x\_t = f\_{\text{VLA}}(o\_t, \ell)\]

The backbone processes 2 RGB images and a dense proprioceptive vector (base pose, linear/angular velocities, joint positions/velocities, end-effector pose, gripper status, target object/goal pose). Output is a unified 13-DoF continuous action command jointly controlling base velocities and arm joint targets.

Anchored Truncated Diffusion Action Head

Trajectory vocabulary. A set of $M=20$ precomputed anchor trajectories $\mathcal{V} = {\bar{A}^{(m)}}_{m=1}^{M}$, where each $\bar{A}^{(m)} \in \mathbb{R}^{H \times d}$, is constructed by K-Means clustering on normalized expert action chunks. Actions are normalized to $[-1, 1]$ before clustering to prevent scale imbalance across DOFs.

Truncated forward process. Instead of sampling from pure Gaussian noise over the full schedule $S$, denoising is restricted to a truncated range $S_{\text{tr}} = \max{1, \lfloor \rho S \rceil}$ with ratio $\rho \in (0, 1]$. The noisy action is anchored:

\[A\_t^{(\tau, m)} = \sqrt{\bar{\alpha}\_\tau}\,\bar{A}^{(m)} + \sqrt{1 - \bar{\alpha}\_\tau}\,\epsilon, \qquad \epsilon \sim \mathcal{N}(0, I)\]

At inference, $\tau$ is sampled from $\mathcal{U}{0, \dots, S_{\text{tr}}-1}$ with $S_{\text{tr}} = 10$, and initialization starts at the truncation boundary $\tau_{\text{start}} = S_{\text{tr}} - 1$.

DiT denoiser. A DiT-style network conditions on task/context tokens, action-query tokens from the VLA backbone, and proprioceptive embeddings via bridge-style KV conditioning and AdaLN modulation:

\[\{\hat{\epsilon}^{(i)}\}\_{i=1}^{N} = f\_\phi\!\left(A\_t^{(\tau)},\, \text{emb}(\tau),\, x\_t\right)\]

The denoised chunk is reconstructed via standard scheduler inversion:

\[\hat{A}\_t^{(m)} = g\!\left(A\_t^{(\tau, m)}, \hat{\epsilon}^{(m)}, \tau\right)\]

Anchor scoring head. A small head predicts a scalar confidence per anchor:

\[s^{(m)} = h\!\left(\hat{A}\_t^{(m)}, x\_t, \tau\right)\]

The final action chunk selects the highest-scoring trajectory: $\hat{m} = \arg\max_m s^{(m)}$.

Residual Correction Module

A lightweight MLP ($+$57K parameters) predicts per-step micro-adjustments during chunk execution to counteract open-loop drift:

\[a\_{t+j} = \hat{a}\_{t+j} + \Delta a\_{t+j}, \qquad \Delta a\_{t+j} = r\_\psi\!\left(o\_{t+j}, \ell, \hat{a}\_{t+j}, j\right)\]

The intra-chunk step index $j \in {0, \dots, H-1}$ is encoded via 1D sinusoidal positional embedding. The module has no execution history dependency — it relies on instantaneous state mismatch only. It is trained separately after the main backbone and diffusion head are frozen.

Total parameters: 726.19M (base) + 57K (residual) = 726.25M.

training

Simulation (ManiSkill-HAB)

Dataset: 1,000 successful RL-generated expert rollouts per task (6 tasks: Pick Apple, Place Apple, Pick Bowl, Place Bowl, Open Fridge, Close Kitchen Counter), from MS-HAB benchmark with SAPIEN physics engine, ReplicaCAD environments, and YCB objects
Backbone: Qwen-2.5 (0.5B), frozen except LoRA adapters (rank 64)
Batch size: 16
Learning rate: $2 \times 10^{-4}$
Gradient steps: 100K
Loss (diffusion head): Joint reconstruction + anchor classification:

\[\mathcal{L} = \sum\_{m=1}^{M} \Big[ y\_m \|\hat{A}\_t^{(m)} - A\_t^\star\|_1 + \lambda\,\text{BCE}(s^{(m)}, y\_m) \Big]\]

where $m^\star = \arg\min_m |A_t^\star - \bar{A}^{(m)}|_1$ selects the nearest anchor to ground truth as the positive sample.

Loss (residual module): L1 regression on corrective adjustments (trained after backbone freeze):

\[\mathcal{L}\_{\text{res}} = \mathbb{E}\big[\|r\_\psi(o\_{t+j}, \ell, \hat{a}\_{t+j}, j) - \Delta a\_{t+j}\|_1\big]\]

Real World

Platform: Unitree Go2 Edu quadruped + SO101 arm
Dataset: 50 teleoperated demonstrations per task (Open Drawer, Lift Bag) collected via Meta Quest 3 VR at 15 Hz control frequency
Robot start position: Randomized ~1.5m from target object
Same hyperparameters as simulation

evaluation

ManiSkill-HAB (Simulation)

6 tasks evaluated over 3 independent trials × 100 episodes each, max 200 timesteps per episode. All baselines use $H=2$ except AnchorVLA which tests both $H=2$ and $H=5$:

Method	Input	Pick Apple	Pick Bowl	Place Apple	Place Bowl	Open Fridge	Close Counter	Average
$\pi_0$	RGB	13.0	15.7	23.3	21.3	31.3	70.0	33.5
ACT	RGB	28.0	28.0	8.7	13.0	2.0	85.7	23.6
RDT	RGB	12.0	10.7	32.0	18.7	82.7	100.0	42.9
DP	RGB	21.3	20.7	28.0	69.3	7.3	55.0	28.8
DP3	PointCloud	0.0	20.0	31.0	32.0	0.0	68.0	21.6
AC-DiT	PointCloud	33.3	36.0	33.3	17.3	90.7	97.3	55.6
AnchorVLA ($H!=!2$)	RGB	22.7	44.5	64.3	63.8	88.9	100.0	64.0
AnchorVLA ($H!=!5$)	RGB	16.1	42.9	57.7	69.6	82.5	100.0	61.5

AnchorVLA ($H!=!2$) achieves SOTA 64.0%, outperforming AC-DiT by +8.4% absolute despite using only RGB inputs vs. point clouds. The $H!=!5$ variant maintains 61.5% — rivaling baselines with 2.5× more frequent re-planning.

Real World (Unitree Go2 + SO101)

Method	Open Drawer	Lift Bag	Average
SmolVLA-0.5B	20%	10%	15%
AnchorVLA	40%	40%	40%

10 trials per task with randomized initial positions. AnchorVLA achieves +25% absolute improvement over SmolVLA at comparable parameter scale.

Inference Efficiency

Per-episode compute: 641.4 TFLOP ($H!=!1$) → 128.3 TFLOP ($H!=!5$), an ~80% reduction
Inference frequency with residual ($H!=!5$): 53.6 Hz (vs. 60.1 Hz without residual — only 11% overhead for +9.3% success)
Inference frequency without residual ($H!=!5$): 60.1 Hz

Key Ablations (Pick Bowl task)

Anchored prior necessity:

Method	Denoising Steps	Success (%)	Freq (Hz)
w/o Anchor	50	12.8	60.6
w/o Anchor	10	0.0	108.0
w/ Anchor	10	42.9	53.6
w/ Anchor	5	12.6	89.8

Anchored diffusion at 10 steps achieves 42.9% where standard diffusion fails completely (0.0%). At equivalent ~12% success, anchors reduce denoising from 50→5 steps (1.48× speedup).

Residual correction: w/o residual at $H!=!5$: 33.6% → w/ residual: 42.9% (+9.3% absolute) for only +57K params.

reproduction guide

Environment Setup

# Clone the repository
git clone https://github.com/jason-lim26/AnchorVLA.git
cd AnchorVLA
pip install -e .

Simulation Training (ManiSkill-HAB)

# Install ManiSkill-HAB benchmark dependencies
pip install mani-skill-hab

# Train AnchorVLA on a single task (e.g., Pick Bowl)
python train.py \
  --task settable_pick_bowl \
  --backbone qwen2.5-0.5b \
  --lora_rank 64 \
  --batch_size 16 \
  --lr 2e-4 \
  --num_anchors 20 \
  --truncated_steps 10 \
  --gradient_steps 100000 \
  --use_proprioception \
  --num_images 2

Inference

# Evaluate trained model
python eval.py \
  --task settable_pick_bowl \
  --checkpoint checkpoints/anchorvla_pick_bowl.pt \
  --chunk_size 2 \
  --num_episodes 100 \
  --num_trials 3

Real-World Deployment

# Train on teleoperation data (50 demos per task)
python train.py \
  --task real_world \
  --backbone qwen2.5-0.5b \
  --lora_rank 64 \
  --batch_size 16 \
  --lr 2e-4 \
  --gradient_steps 100000

# Deploy on Unitree Go2 + SO101
python deploy.py \
  --checkpoint checkpoints/anchorvla_realworld.pt \
  --robot unitree_go2_so101 \
  --control_freq 15

Key Configuration Notes

Anchor vocabulary is built offline via K-Means (M=20) on normalized expert action chunks before training
Residual MLP is trained in a second stage after freezing the backbone + diffusion head
Action normalization to $[-1, 1]$ is applied before clustering and training
For $H!=!5$ deployment, use --chunk_size 5 for maximum compute savings (~80% reduction)

notes

First application of anchored truncated diffusion to robot manipulation action generation. Prior work used this technique only in autonomous driving (DiffusionDrive) and sparse keypose planning (AnchorDP3). AnchorVLA applies it to dense, whole-body trajectory chunks for mobile manipulation.
Anchors as structured priors solve the mode-averaging–efficiency trade-off. By initializing from clustered expert trajectories rather than Gaussian noise, the model stays on the demonstration manifold with far fewer denoising steps. The scoring head adds an extra selection mechanism that standard diffusion policies lack.
Residual module is conceptually related to exposure bias mitigation. Rather than irrevocably committing to predicted actions, the policy observes realized rollout and applies corrective updates — similar in spirit to scheduled sampling or test-time adaptation, but implemented as a separate lightweight network.
RGB-only competitive with point-cloud methods. AnchorVLA’s 64.0% with pure RGB vs. AC-DiT’s 55.6% with explicit 3D geometry suggests that the generative action head compensates for weaker spatial perception, an interesting direction for VLA design.
Graceful degradation at longer horizons. The $H!=!5$ variant (61.5%) nearly matches $H!=!2$ (64.0%), enabling an 80% reduction in per-episode compute. This is remarkable compared to the L1 baseline which collapses from 42.8% to 8.8% over the same range.
Limitations: Performance is bounded by demonstration coverage (offline IL only). The authors note RL-based test-time correction as future work. The 13-DoF unified action space may not scale to more complex manipulators. Real-world evaluation is limited to 2 tasks with 10 trials each.
Connection to scaling VLAs: The architecture is compatible with larger backbones (Qwen-2.5 at 0.5B is modest). The anchor + scoring mechanism could complement larger-scale diffusion VLAs like $\pi_0$ to improve their inference efficiency without retraining the full model.