2026-03-30
pi, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation
Johnathan Tucker, Jiankai Sun, Denis Liu, Brandon Kim, Aiden Swann, Lachlain McGranahan, Allen Ren, Quan Vuong, Javier Yu, Mac Schwager et al.
problem
VLA foundation models like $\pi_0$ are pretrained on fixed-base robot manipulators operating in quasi-static regimes. transferring these to aerial platforms (underactuated quadrotors with 6-DoF flight) is fundamentally hard: the dynamics gap means control commands that work for a robot arm cause a drone to crash. grasping changes the effective mass mid-flight, causing altitude sag. onboard cameras experience large ego-motion and motion blur absent in tabletop data. no prior work has systematically investigated whether manipulation-pretrained VLAs can transfer to aerial manipulators. AirVLA is the first such study.
architecture
AirVLA builds on $\pi_0$’s flow-matching architecture with two key additions:
payload-aware guidance in the flow-matching sampler. during inference, a gradient correction term is injected into the velocity field:
\[v\_{\text{guid}}(x\_\tau, o, \tau) = v\_\theta(x\_\tau, o, \tau) - s(\tau) \xi\]where $\xi = \nabla_{x_\tau} \hat{A}_\theta(x_\tau, o, \tau)^\top \nabla_A \Phi(\hat{A}_\theta(x_\tau, o, \tau); o)$ is the vector-Jacobian product mapping the action-space gradient back to latent space.
the payload loss operates only on the altitude dimension:
\[\Phi\_{\text{payload}}(A; o, A\_{t-1}) = \frac{\lambda\_z}{2} \sum\_{t=0}^{H-1} \alpha(o, A\_{t-1}) w\_t (z\_t(A) - z\_{\text{des}}(o))^2\]where $z_{\text{des}} = z_{\text{curr}} + \Delta z$ biases the drone slightly higher under load ($\Delta z > 0$). the payload confidence $\alpha(o, A_{t-1}) \in [0, 1]$ is computed from recent gripper commands and measured aperture, smoothly gating the guidance: $\alpha \approx 0$ during free flight (no correction), $\alpha \approx 1$ when carrying a load. this composes additively with RTC’s continuity objective.
Gaussian splatting synthetic data pipeline. photorealistic 3DGS reconstructions of the environment are coupled with a semi-kinematic drone dynamics model (position, velocity, orientation quaternion; normalized thrust + angular velocity controls). the gripper is treated as a separate foreground layer segmented by SAM, composited onto clean scene renders to avoid observation bias. domain-randomized trajectories perturb initial states and waypoints to induce recovery behaviors near obstacles.
hardware: ModalAI Starling 2 Max quadrotor with VOXL 2 (Qualcomm QRB5165), custom 3D-printed UMI-style gripper, 3 cameras (forward, downward, external). RGB at 256x256, 5 Hz. actions are 4-DoF delta poses + gripper at 10 Hz via PX4.
training
- fine-tunes $\pi_0$ on 270 teleoperated demos (pick-and-place + navigation) + 50 synthetic navigation trajectories from 3DGS pipeline
- synthetic data generated from short walk-throughs reconstructed with Nerfstudio, with domain-randomized waypoints and recovery trajectories
- action chunk size: standard $\pi_0$ chunking with Real-Time Chunking (RTC) for async inference
- fine-tuning details not fully specified (follows $\pi_0$ pipeline)
- evaluation: 460 total real-world flight trials
evaluation
single-task pick-and-place (20 trials each):
| method | pick | place |
|---|---|---|
| $\pi_0$ naive | 0.0% | - |
| $\pi_0$ + RTC | 23.5% | 80.0% |
| $\pi_0$ + RTC + payload guidance (ours) | 50.0% | 100.0% |
| ACT | 0.0% | - |
| Diffusion Policy | 0.0% | - |
navigation with synthetic data (20 trials each):
| method | gate | hover |
|---|---|---|
| $\pi_0$ naive (non-synthetic) | 50.0% | 45.0% |
| $\pi_0$ + RTC (non-synthetic) | 80.0% | 95.0% |
| $\pi_0$ + RTC (synthetic) | 100.0% | 100.0% |
compositional navigate-then-grasp (with synthetic, 20 trials):
| method | gate | hover | pick | place |
|---|---|---|---|---|
| $\pi_0$ + RTC + payload guidance (ours) | 85.0% | 94.7% | 83.3% | 62.5% |
OOD robustness: 70% pick success on novel sandwich, 10% on chips bag. 40% gate success in right region, 0% in front/left.
reproduction guide
- hardware: ModalAI Starling 2 Max, custom UMI gripper (3D printed, hobby servos), 3 cameras, motion capture system
- data collection: 270 teleoperated aerial demos + 50 synthetic trajectories from 3DGS pipeline
- 3DGS setup: capture short walk-throughs, reconstruct with Nerfstudio, segment gripper with SAM, synthesize domain-randomized trajectories via drone dynamics model
- fine-tuning: fine-tune $\pi_0$ on combined dataset following standard $\pi_0$ pipeline
- deployment: enable RTC + payload-aware guidance at inference time. set $\lambda_z$ and $\Delta z$ based on expected payload mass
- gotchas: motion capture dependency is a major limitation (future work: VIO/SLAM). synthetic data is critical for navigation (20% boost). payload guidance helps most on the place stage. the method is brittle to novel gate positions and object geometries with only 270 demos
notes
- the visual/semantic representations from $\pi_0$ transfer well to aerial viewpoints, but the control dynamics do not. this is a clean separation of what transfers and what doesn’t in cross-embodiment settings
- payload-aware guidance is elegant: it’s a physics-informed injection into the generative sampling process, not a separate controller. the payload confidence gating makes it seamlessly activate/deactivate
- 3DGS synthetic data is the biggest single contributor to navigation performance (50% -> 100% gate success). this is a practical technique for data-scarce aerial robotics
- the 270 demo dataset is very small by VLA standards, and it shows in the OOD results. for bopi’s purposes, this highlights that VLA transfer to new embodiments still requires significant embodiment-specific data
- connects to the real-time chunking (RTC) line of work: inference-time guidance is becoming a standard pattern for bridging the gap between generalist policies and physical constraints