2026-03-30

Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Wenxuan Song, Jiayi Chen, Shuai Chen, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan Wang, Haoang Li et al.

VLA discrete-diffusion real-time-robotics KV-cache

problem

discrete diffusion VLA (dVLA) models like Dream-VLA, DD-VLA, and UD-VLA output actions via iterative denoising of discrete tokens, giving them better multimodal alignment than flow-matching VLAs but at a severe inference speed penalty. the bidirectional attention mechanism in dVLAs prevents KV cache reuse between denoising steps, making each forward pass expensive. typical dVLA execution frequencies fall far below the 30 Hz required for real-time robot control. prior acceleration strategies like Fast-dLLM (which forces KV reuse under bidirectional attention) cause performance drops because the cached KV states are stale. block diffusion (sequential blocks with intra-block parallelism) helps but precludes inter-block parallelism. Fast-dVLA is the first method to bring dVLAs to real-time speed (30 Hz) while maintaining or improving task success rates.

architecture

Fast-dVLA makes two key architectural changes to the standard dVLA:

block-wise causal attention. instead of full bidirectional attention over the entire action sequence, Fast-dVLA restricts attention so each block only attends to its prefix (prompt + earlier completed blocks) and tokens within itself. once a block is fully decoded, its KV states remain fixed and can be cached for all subsequent denoising iterations. this is the critical enabler for inference speedup.

diffusion forcing for inter-block parallelism. blocks are assigned progressively increasing noise levels: $t_1 < t_2 < \cdots < t_N$. earlier blocks see less corruption and decode faster, while later blocks remain more masked. the reverse process factorizes as:

\[p\_\theta(Y^0 \mid Y^{t\_{1:N}}) = \prod\_{i=1}^{N} p\_\theta(Y\_{B\_i}^0 \mid Y\_{B\_1}^{t\_1}, \ldots, Y\_{B\_i}^{t\_i})\]

this means blocks can be denoised in parallel even though they carry different noise levels, since later blocks only need to condition on partially-cleaned earlier blocks (via block-wise causal attention).

pipelined parallel decoding at inference. blocks transition through three states: un-activated, semi-activated, fully-activated. a new block enters semi-activated when the previous block’s completion ratio exceeds $\tau_{\text{add}} = 0.5$, and becomes fully-activated at $\tau_{\text{act}} = 0.7$. semi-activated blocks use confidence-guided decoding with threshold $\tau_{\text{conf}} = 0.5$, while fully-activated blocks use logarithmic scheduling (at least 1/2 of remaining tokens decoded per step, i.e., $\log_2$). this pipeline allows multiple blocks to be processed concurrently.

training: asymmetric distillation. rather than training from scratch, Fast-dVLA distills from an existing finetuned bidirectional dVLA teacher using LoRA (rank 32). the distillation loss is:

\[\mathcal{L}\_{\text{AD}} = \mathbb{E}\left[\sum\_{i=1}^{N} D\_{\text{KL}}\left(p\_\theta(Y\_{B\_i}^0 \mid Y\_{B\_{<i}}^{t\_{<i}}, c) \,\|\, p\_{\phi^-}(Y\_{B\_i}^0 \mid Y\_{B\_{\leq N}}^{t\_{\leq N}}, c)\right)\right]\]

the teacher sees all blocks (bidirectional), the student only sees completed prior blocks (causal). during distillation, LoRA is disabled for teacher logits and enabled for student logits, preserving the pretrained backbone’s visual-language priors.

training

  • distillation steps: 4k for Dream-VLA (1/5 of original finetuning), 4k for DD-VLA (1/8 of original), 3k for UD-VLA (1/8 of original)
  • batch size: 12 for UD-VLA, follows original configs for others
  • block size: 7 (matching action dimensionality of 7 DoF) for Dream-VLA and DD-VLA; multiple of 32 for UD-VLA (which has 625-token sequences)
  • convergence: asymmetric distillation converges in ~2000 steps (5x faster than training from finetuned weights, 10x faster than from scratch)
  • LoRA rank: 32
  • hardware: not explicitly stated, but standard GPU training implied

evaluation

LIBERO benchmark (Dream-VLA base):

method spatial goal object long avg speed
Dream-VLA baseline 90.2 92.0 88.0 72.0 85.6 98.8 tok/s (1.0x)
+ Fast-dLLM 88.4 89.4 83.4 70.2 82.8 183.2 (1.9x)
+ Block Diffusion 91.8 90.4 88.6 72.2 85.8 181.7 (1.8x)
+ Fast-dVLA (ours) 91.2 92.0 90.2 74.6 87.0 313.1 (3.2x)

LIBERO benchmark (DD-VLA base):

method spatial goal object long avg speed
DD-VLA baseline 97.2 98.6 97.4 92.0 96.3 152.1 tok/s (1.0x)
+ Fast-dLLM 94.0 95.2 94.8 89.8 93.5 312.5 (3.2x)
+ Block Diffusion 97.6 98.6 97.2 93.2 96.7 322.1 (3.3x)
+ Fast-dVLA (ours) 97.0 98.8 97.6 92.8 96.6 402.7 (4.1x)

CALVIN ABCD->D (UD-VLA base): 2.8x speedup (67.3 to 186.7 tok/s), avg length 4.54 vs 4.64 baseline. competitive with world-modeling VLAs like UP-VLA and MDT.

SimplerEnv (real-to-sim): highest decoding speed among all discrete-output VLAs. 59.3% avg task success, outperforming continuous flow-matching methods like $\pi_0$ (32.1%) and GR00T-N1 (36.5%).

real-world (bimanual AgileX, 6-DOF arms): consistent 30 Hz execution frequency across all tasks. nearly 2x efficiency of prior methods on conveyor picking. competitive success rates with faster completion times.

reproduction guide

  1. prerequisites: existing finetuned dVLA model (Dream-VLA, DD-VLA, or UD-VLA), PyTorch, GPU with sufficient memory
  2. setup: clone the project page repo (https://chris1220313648.github.io/Fast-dVLA/). configure LoRA rank to 32
  3. distillation: run asymmetric distillation for 4k steps (Dream-VLA/DD-VLA) or 3k steps (UD-VLA) with block size 7 (or 32+ for UD-VLA)
  4. inference: set $\tau_{\text{conf}} = 0.5$, $\tau_{\text{add}} = 0.5$, $\tau_{\text{act}} = 0.7$, logarithmic factor $\log_2$
  5. expected result: 2.8-4.1x speedup with maintained or improved success rates
  6. gotchas: block size should be a multiple of action dimensionality for best performance. confidence threshold $\tau_{\text{conf}}$ is the main speed/quality knob. the LoRA branches must be disabled for teacher, enabled for student during distillation

notes

  • the key insight is that dVLAs already exhibit an implicit left-to-right decoding pattern despite bidirectional attention, making block-wise causal attention a natural fit rather than a disruptive change
  • asymmetric distillation is remarkably efficient: 2000 steps to convergence from a finetuned teacher, which is 10x cheaper than training from scratch
  • this method is complementary to other VLA efficiency techniques (quantization, pruning, early exit) and could stack with them
  • the 30 Hz real-time threshold is met consistently, which is the practical requirement for physical robot control
  • for bopi’s use case: if using dVLA for robot control, Fast-dVLA is essential for real-time performance. the distillation cost is low (4k steps) and the speedup is substantial (up to 4.1x)
  • connects to the VLA-vs-world-model question: Fast-dVLA makes dVLAs competitive with flow-matching VLAs in speed, so the choice between them can be based on representational quality rather than inference speed