Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Wenxuan Song, Jiayi Chen, Shuai Chen, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan Wang, Haoang Li et al.

problem

discrete diffusion VLA (dVLA) models like Dream-VLA, DD-VLA, and UD-VLA output actions via iterative denoising of discrete tokens, giving them better multimodal alignment than flow-matching VLAs but at a severe inference speed penalty. the bidirectional attention mechanism in dVLAs prevents KV cache reuse between denoising steps, making each forward pass expensive. typical dVLA execution frequencies fall far below the 30 Hz required for real-time robot control. prior acceleration strategies like Fast-dLLM (which forces KV reuse under bidirectional attention) cause performance drops because the cached KV states are stale. block diffusion (sequential blocks with intra-block parallelism) helps but precludes inter-block parallelism. Fast-dVLA is the first method to bring dVLAs to real-time speed (30 Hz) while maintaining or improving task success rates.

architecture

Fast-dVLA makes two key architectural changes to the standard dVLA:

block-wise causal attention. instead of full bidirectional attention over the entire action sequence, Fast-dVLA restricts attention so each block only attends to its prefix (prompt + earlier completed blocks) and tokens within itself. once a block is fully decoded, its KV states remain fixed and can be cached for all subsequent denoising iterations. this is the critical enabler for inference speedup.

diffusion forcing for inter-block parallelism. blocks are assigned progressively increasing noise levels: $t_1 < t_2 < \cdots < t_N$. earlier blocks see less corruption and decode faster, while later blocks remain more masked. the reverse process factorizes as:

\[p\_\theta(Y^0 \mid Y^{t\_{1:N}}) = \prod\_{i=1}^{N} p\_\theta(Y\_{B\_i}^0 \mid Y\_{B\_1}^{t\_1}, \ldots, Y\_{B\_i}^{t\_i})\]

this means blocks can be denoised in parallel even though they carry different noise levels, since later blocks only need to condition on partially-cleaned earlier blocks (via block-wise causal attention).

pipelined parallel decoding at inference. blocks transition through three states: un-activated, semi-activated, fully-activated. a new block enters semi-activated when the previous block’s completion ratio exceeds $\tau_{\text{add}} = 0.5$, and becomes fully-activated at $\tau_{\text{act}} = 0.7$. semi-activated blocks use confidence-guided decoding with threshold $\tau_{\text{conf}} = 0.5$, while fully-activated blocks use logarithmic scheduling (at least 1/2 of remaining tokens decoded per step, i.e., $\log_2$). this pipeline allows multiple blocks to be processed concurrently.

training: asymmetric distillation. rather than training from scratch, Fast-dVLA distills from an existing finetuned bidirectional dVLA teacher using LoRA (rank 32). the distillation loss is:

\[\mathcal{L}\_{\text{AD}} = \mathbb{E}\left[\sum\_{i=1}^{N} D\_{\text{KL}}\left(p\_\theta(Y\_{B\_i}^0 \mid Y\_{B\_{<i}}^{t\_{<i}}, c) \,\|\, p\_{\phi^-}(Y\_{B\_i}^0 \mid Y\_{B\_{\leq N}}^{t\_{\leq N}}, c)\right)\right]\]

the teacher sees all blocks (bidirectional), the student only sees completed prior blocks (causal). during distillation, LoRA is disabled for teacher logits and enabled for student logits, preserving the pretrained backbone’s visual-language priors.

training

distillation steps: 4k for Dream-VLA (1/5 of original finetuning), 4k for DD-VLA (1/8 of original), 3k for UD-VLA (1/8 of original)
batch size: 12 for UD-VLA, follows original configs for others
block size: 7 (matching action dimensionality of 7 DoF) for Dream-VLA and DD-VLA; multiple of 32 for UD-VLA (which has 625-token sequences)
convergence: asymmetric distillation converges in ~2000 steps (5x faster than training from finetuned weights, 10x faster than from scratch)
LoRA rank: 32
hardware: not explicitly stated, but standard GPU training implied

evaluation

LIBERO benchmark (Dream-VLA base):

method	spatial	goal	object	long	avg	speed
Dream-VLA baseline	90.2	92.0	88.0	72.0	85.6	98.8 tok/s (1.0x)
+ Fast-dLLM	88.4	89.4	83.4	70.2	82.8	183.2 (1.9x)
+ Block Diffusion	91.8	90.4	88.6	72.2	85.8	181.7 (1.8x)
+ Fast-dVLA (ours)	91.2	92.0	90.2	74.6	87.0	313.1 (3.2x)

LIBERO benchmark (DD-VLA base):

method	spatial	goal	object	long	avg	speed
DD-VLA baseline	97.2	98.6	97.4	92.0	96.3	152.1 tok/s (1.0x)
+ Fast-dLLM	94.0	95.2	94.8	89.8	93.5	312.5 (3.2x)
+ Block Diffusion	97.6	98.6	97.2	93.2	96.7	322.1 (3.3x)
+ Fast-dVLA (ours)	97.0	98.8	97.6	92.8	96.6	402.7 (4.1x)

CALVIN ABCD->D (UD-VLA base): 2.8x speedup (67.3 to 186.7 tok/s), avg length 4.54 vs 4.64 baseline. competitive with world-modeling VLAs like UP-VLA and MDT.

SimplerEnv (real-to-sim): highest decoding speed among all discrete-output VLAs. 59.3% avg task success, outperforming continuous flow-matching methods like $\pi_0$ (32.1%) and GR00T-N1 (36.5%).

real-world (bimanual AgileX, 6-DOF arms): consistent 30 Hz execution frequency across all tasks. nearly 2x efficiency of prior methods on conveyor picking. competitive success rates with faster completion times.

reproduction guide

prerequisites: existing finetuned dVLA model (Dream-VLA, DD-VLA, or UD-VLA), PyTorch, GPU with sufficient memory
setup: clone the project page repo (https://chris1220313648.github.io/Fast-dVLA/). configure LoRA rank to 32
distillation: run asymmetric distillation for 4k steps (Dream-VLA/DD-VLA) or 3k steps (UD-VLA) with block size 7 (or 32+ for UD-VLA)
inference: set $\tau_{\text{conf}} = 0.5$, $\tau_{\text{add}} = 0.5$, $\tau_{\text{act}} = 0.7$, logarithmic factor $\log_2$
expected result: 2.8-4.1x speedup with maintained or improved success rates
gotchas: block size should be a multiple of action dimensionality for best performance. confidence threshold $\tau_{\text{conf}}$ is the main speed/quality knob. the LoRA branches must be disabled for teacher, enabled for student during distillation

notes

the key insight is that dVLAs already exhibit an implicit left-to-right decoding pattern despite bidirectional attention, making block-wise causal attention a natural fit rather than a disruptive change
asymmetric distillation is remarkably efficient: 2000 steps to convergence from a finetuned teacher, which is 10x cheaper than training from scratch
this method is complementary to other VLA efficiency techniques (quantization, pruning, early exit) and could stack with them
the 30 Hz real-time threshold is met consistently, which is the practical requirement for physical robot control
for bopi’s use case: if using dVLA for robot control, Fast-dVLA is essential for real-time performance. the distillation cost is low (4k steps) and the speedup is substantial (up to 4.1x)
connects to the VLA-vs-world-model question: Fast-dVLA makes dVLAs competitive with flow-matching VLAs in speed, so the choice between them can be based on representational quality rather than inference speed