2026-03-31
Realtime-VLA V2: Learning to Run VLAs Fast, Smooth, and Accurate
Wenhao Yu, Chen Wang, Ruohan Zhang, Linxi Fan, Yuke Zhu, Li Fei-Fei, Ankit Goyal, Sanja Fidler et al.
problem
VLA models produce high-quality actions in simulation but execute too slowly on real robots. the gap between neural network inference speed and physical robot control frequency (typically 3-30 Hz) makes direct deployment impractical. prior work like VLASH and $\pi_{0}$-Fast focused on GPU-side inference scheduling, but left the full robot deployment stack open. this paper tackles the orthogonal problem: given a VLA that already runs fast on GPU, how do you actually make the robot move fast in the real world?
architecture
this is a systems paper, not a model architecture paper. no new VLA model is proposed. instead it introduces a four-stage deployment pipeline that wraps around any VLA:
flowchart LR
A[VLA policy] --> B[speed adaptation model]
B --> C[temporal optimization]
C --> D[spatial optimization]
D --> E[robot execution]
F[camera calibration] --> C
G[robot dynamics] --> D
stage 1 – sub-5ms system delay calibration: sinusoidal sweep signals sent to the robot while recording camera timestamps. estimates the total loop delay (camera capture + image transfer + inference + command dispatch) to under 5ms precision. uses 120fps camera phase estimation to align perception and control clocks.
stage 2 – speed adaptation model: a lightweight regression model trained on human-in-the-loop demonstrations where a human operator specifies desired speed profiles. trained incrementally – each day of operation collects new data and retrains. maps context (task type, environment layout) to a speed throttle multiplier.
stage 3 – temporal optimization: formulates step-duration allocation as a quadratic program (QP) solved via OSQP. distributes acceleration evenly across trajectory segments to respect robot kinematic limits while minimizing total execution time. given $n$ waypoints with positions $q_i$ and durations $\tau_i$, minimizes $\sum \tau_i$ subject to velocity and acceleration constraints.
stage 4 – spatial optimization: acados MPC running in SQP-RTI (real-time iteration) mode. pre-amplifies VLA commands to compensate for ~150ms robot mechanical lag. enforces joint position, velocity, and acceleration limits as hard constraints. uses the robot’s rigid body dynamics model.
roofline analysis: identifies motion-bounded segments (where VLA inference is faster than the robot can move) vs. control-bounded segments (where computation is the bottleneck). temporal optimization helps the former; spatial optimization helps the latter.
training
the speed adaptation model is the only trainable component:
- architecture: lightweight regression network (specifics not disclosed in paper)
- training: human-in-the-loop throttle data collected during daily operation
- update cycle: retrained daily as new demonstration data accumulates
- no VLA fine-tuning required – the pipeline is model-agnostic
evaluation
real robot evaluation on table-top manipulation tasks with a Franka Emika Panda arm:
| task | human baseline (s) | unoptimized VLA (s) | VLA V2 pipeline (s) | speedup |
|---|---|---|---|---|
| fold shirt | 19.0 | 75.3 | 18.9 | 3.98x |
| place into fixture | 37.6 | 89.5 | 37.8 | 2.37x |
| pick and latch | 36.0 | 98.6 | 42.6 | 2.31x |
the pipeline brings VLA execution time down to human-parity levels for simple tasks (fold shirt) and within 12-18% of human baseline for harder tasks. critical observation: the unoptimized VLA is 2-3x slower than humans because it stops between waypoints and moves at constant speed. the optimization pipeline eliminates these inefficiencies.
limitation: the paper reports execution time only, not task success rate or ablation studies. no comparison against a simple uniform-speedup baseline. the base VLA model is not specified.
reproduction guide
git clone https://github.com/dexmal/realtime-vla-v2.git
pip install -r requirements.txt # acados, osqp, robot-specific drivers
requires a physical robot (Franka Panda or similar) with position control interface. the calibration procedure needs a 120fps+ camera and the ability to send arbitrary joint commands for sinusoidal sweeps.
gotchas:
- system delay calibration must be re-run whenever the software stack changes (new camera driver, different inference runtime)
- the speed adaptation model requires human demonstrations in the target environment – it does not transfer zero-shot
- acados MPC setup requires accurate URDF and inertial parameters for the specific robot
- no simulation support – everything is designed for real-world deployment
compute cost: the pipeline runs entirely on a single workstation with a consumer GPU. the MPC solver adds negligible overhead (~1ms per step).
notes
this paper is important because it exposes a blind spot in the VLA research community: everyone optimizes model architecture and inference scheduling, but nobody has published a systematic study of the full deployment stack. the four-stage pipeline is orthogonal to model-level optimizations from MMaDA-VLA (2603.25406), Fast-dVLA (2603.25661), and DFM-VLA (2603.26320) – all of those could benefit from this deployment stack.
the roofline analysis framework (motion-bounded vs. control-bounded) is a useful mental model for anyone deploying neural policies on real hardware. for bopi specifically, the calibration and MPC stages are directly applicable to getting a VLA running on the physical bopi body.