pi, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation

Johnathan Tucker, Jiankai Sun, Denis Liu, Brandon Kim, Aiden Swann, Lachlain McGranahan, Allen Ren, Quan Vuong, Javier Yu, Mac Schwager et al.

problem

VLA foundation models like $\pi_0$ are pretrained on fixed-base robot manipulators operating in quasi-static regimes. transferring these to aerial platforms (underactuated quadrotors with 6-DoF flight) is fundamentally hard: the dynamics gap means control commands that work for a robot arm cause a drone to crash. grasping changes the effective mass mid-flight, causing altitude sag. onboard cameras experience large ego-motion and motion blur absent in tabletop data. no prior work has systematically investigated whether manipulation-pretrained VLAs can transfer to aerial manipulators. AirVLA is the first such study.

architecture

AirVLA builds on $\pi_0$’s flow-matching architecture with two key additions:

payload-aware guidance in the flow-matching sampler. during inference, a gradient correction term is injected into the velocity field:

\[v\_{\text{guid}}(x\_\tau, o, \tau) = v\_\theta(x\_\tau, o, \tau) - s(\tau) \xi\]

where $\xi = \nabla_{x_\tau} \hat{A}_\theta(x_\tau, o, \tau)^\top \nabla_A \Phi(\hat{A}_\theta(x_\tau, o, \tau); o)$ is the vector-Jacobian product mapping the action-space gradient back to latent space.

the payload loss operates only on the altitude dimension:

\[\Phi\_{\text{payload}}(A; o, A\_{t-1}) = \frac{\lambda\_z}{2} \sum\_{t=0}^{H-1} \alpha(o, A\_{t-1}) w\_t (z\_t(A) - z\_{\text{des}}(o))^2\]

where $z_{\text{des}} = z_{\text{curr}} + \Delta z$ biases the drone slightly higher under load ($\Delta z > 0$). the payload confidence $\alpha(o, A_{t-1}) \in [0, 1]$ is computed from recent gripper commands and measured aperture, smoothly gating the guidance: $\alpha \approx 0$ during free flight (no correction), $\alpha \approx 1$ when carrying a load. this composes additively with RTC’s continuity objective.

Gaussian splatting synthetic data pipeline. photorealistic 3DGS reconstructions of the environment are coupled with a semi-kinematic drone dynamics model (position, velocity, orientation quaternion; normalized thrust + angular velocity controls). the gripper is treated as a separate foreground layer segmented by SAM, composited onto clean scene renders to avoid observation bias. domain-randomized trajectories perturb initial states and waypoints to induce recovery behaviors near obstacles.

hardware: ModalAI Starling 2 Max quadrotor with VOXL 2 (Qualcomm QRB5165), custom 3D-printed UMI-style gripper, 3 cameras (forward, downward, external). RGB at 256x256, 5 Hz. actions are 4-DoF delta poses + gripper at 10 Hz via PX4.

training

fine-tunes $\pi_0$ on 270 teleoperated demos (pick-and-place + navigation) + 50 synthetic navigation trajectories from 3DGS pipeline
synthetic data generated from short walk-throughs reconstructed with Nerfstudio, with domain-randomized waypoints and recovery trajectories
action chunk size: standard $\pi_0$ chunking with Real-Time Chunking (RTC) for async inference
fine-tuning details not fully specified (follows $\pi_0$ pipeline)
evaluation: 460 total real-world flight trials

evaluation

single-task pick-and-place (20 trials each):

method	pick	place
$\pi_0$ naive	0.0%	-
$\pi_0$ + RTC	23.5%	80.0%
$\pi_0$ + RTC + payload guidance (ours)	50.0%	100.0%
ACT	0.0%	-
Diffusion Policy	0.0%	-

navigation with synthetic data (20 trials each):

method	gate	hover
$\pi_0$ naive (non-synthetic)	50.0%	45.0%
$\pi_0$ + RTC (non-synthetic)	80.0%	95.0%
$\pi_0$ + RTC (synthetic)	100.0%	100.0%

compositional navigate-then-grasp (with synthetic, 20 trials):

method	gate	hover	pick	place
$\pi_0$ + RTC + payload guidance (ours)	85.0%	94.7%	83.3%	62.5%

OOD robustness: 70% pick success on novel sandwich, 10% on chips bag. 40% gate success in right region, 0% in front/left.

reproduction guide

hardware: ModalAI Starling 2 Max, custom UMI gripper (3D printed, hobby servos), 3 cameras, motion capture system
data collection: 270 teleoperated aerial demos + 50 synthetic trajectories from 3DGS pipeline
3DGS setup: capture short walk-throughs, reconstruct with Nerfstudio, segment gripper with SAM, synthesize domain-randomized trajectories via drone dynamics model
fine-tuning: fine-tune $\pi_0$ on combined dataset following standard $\pi_0$ pipeline
deployment: enable RTC + payload-aware guidance at inference time. set $\lambda_z$ and $\Delta z$ based on expected payload mass
gotchas: motion capture dependency is a major limitation (future work: VIO/SLAM). synthetic data is critical for navigation (20% boost). payload guidance helps most on the place stage. the method is brittle to novel gate positions and object geometries with only 270 demos

notes

the visual/semantic representations from $\pi_0$ transfer well to aerial viewpoints, but the control dynamics do not. this is a clean separation of what transfers and what doesn’t in cross-embodiment settings
payload-aware guidance is elegant: it’s a physics-informed injection into the generative sampling process, not a separate controller. the payload confidence gating makes it seamlessly activate/deactivate
3DGS synthetic data is the biggest single contributor to navigation performance (50% -> 100% gate success). this is a practical technique for data-scarce aerial robotics
the 270 demo dataset is very small by VLA standards, and it shows in the OOD results. for bopi’s purposes, this highlights that VLA transfer to new embodiments still requires significant embodiment-specific data
connects to the real-time chunking (RTC) line of work: inference-time guidance is becoming a standard pattern for bridging the gap between generalist policies and physical constraints