2026-04-05
Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
Haochen Niu, Kanyu Zhang, Shuyu Yin, Qinghai Guo, Peilin Liu, Fei Wen et al.
problem
VLA finetuning inherits language-style training objectives (one-hot cross-entropy or its RL analogues such as PPO, GRPO) that assume a single correct token at each step. This ignores a fundamental property of physical manipulation: feasible action neighborhoods (FAN) — for any state $s$, there exists a connected set of neighboring actions around $a^*(s)$ that yield near-identical task progress. Formalized as:
\[\mathbb{N}\_\delta(s) \subseteq \left\{a \in A : Q(s, a^*(s)) - Q(s, a) \leq \delta \right\}\]Two concrete failure modes result from ignoring FAN:
- SFT overfitting: with small task-specific datasets (e.g., 16K trajectories), the policy collapses probability mass onto a single demonstrated action bin, producing “spiky” distributions and poor OOD generalization. OpenVLA + SFT achieves only 78.1% in-distribution success on ManiSkill vs. 89.8% with the proposed method.
- RFT sample inefficiency: PPO/GRPO must implicitly discover action tolerance through exploration, requiring ~3× more training steps to reach 90% success rate compared to the proposed FAN-PPO (249 vs. 98 steps for OpenVLA on ManiSkill).
Prior regularization approaches are inadequate: label smoothing (ε = 0.05) provides only modest gains (+4.7% IND, +2.0% avg OOD on ManiSkill) and degrades at higher ε; entropy maximization is unstructured and promotes exploration rather than modeling the local geometry of action tolerance.
architecture
FAN introduces no architectural changes — it is a pure loss-level regularizer compatible with any autoregressive VLA backbone. Experiments use two VLA models:
- OpenVLA [Kim et al., 2024]: SigLIP + DINOv2 dual visual encoders fused with Llama2-7B, autoregressively predicting 7-DoF action tokens over discretized bins. Outputs a single action per step.
- OpenVLA-OFT [Kim et al., 2025]: Extended variant that outputs action chunks (8 steps open-loop), accepts third-person + wrist camera images and robot proprioceptive state.
Both are finetuned with LoRA (rank 32). All experiments on NVIDIA A100 80GB GPUs.
training
The core contribution is the FAN-guided regularizer, defined as the KL divergence between the policy $\pi_\theta$ and a target Gaussian $\mathcal{N}(\mu(s), \Sigma(s))$:
\[\mathcal{L}\_{\text{FAN}} = \mathbb{E}\_s\left[D\_{\text{KL}}\!\left(\pi(\cdot \mid s) \,\|\, \mathcal{N}(\cdot \mid \mu(s), \Sigma(s))\right)\right]\]where $\mu(s) = \arg\max_a \pi(a \mid s)$ is the policy’s own mode. The implementation differs between SFT and RFT:
FAN-SFT — adaptive covariance: \(\mathcal{L}\_{\text{FAN-SFT}}(\theta) = -\frac{1}{n}\sum\_{i=1}^{n}\sum\_{t=0}^{K^i-1}\left(\log \pi\_\theta(a\_t^i \mid s\_t^i, l^i) + \alpha\, D\_{\text{KL}}\!\left(\pi\_\theta(\cdot \mid s\_t^i, l^i) \| \mathcal{N}(\cdot \mid \mu, \Sigma)\right)\right)\)
The covariance is dynamically set to the policy’s own variance: $\Sigma(s) = \text{diag}!\left(\sum_{a \in A} \pi(a \mid s, l)(a - \mu(s))^2\right)$.
FAN-PPO — fixed covariance $\Sigma = \sigma^2 I$ for training stability: \(\mathcal{L}\_{\text{FAN-PPO}}(\theta) = -\frac{1}{K}\sum\_{k=0}^{K-1}\left[\min\!\left(\hat{I}\_t^k \hat{A},\, \text{Clip}(\hat{I}\_t^k, 1{-}\epsilon, 1{+}\epsilon)\hat{A}\right) - \alpha\, D\_{\text{KL}}\!\left(\pi\_\theta(\cdot \mid s\_k, l) \| \mathcal{N}(\cdot \mid \mu(s\_k), \Sigma)\right)\right]\)
The optimal policy has a closed-form (Proposition 1): $\pi_{t+1}(a \mid s, l) \propto \mathcal{N}(a \mid \mu, \Sigma)^{\frac{\alpha}{\alpha+\beta^}} \pi_t(a \mid s, l)^{\frac{\beta^}{\alpha+\beta^}} \exp!\left(\frac{Q^{\pi_t}(s,a,l)}{\alpha+\beta^}\right)$, revealing a geometric interpolation between the target Gaussian and the previous policy, re-weighted by Q-values.
Key hyperparameters:
| Setting | Model | α | σ | Notes |
|---|---|---|---|---|
| SFT ManiSkill | OpenVLA | 0.05 | adaptive | 4×A100, LR 5e-4, batch 40, 60K steps, 16K demos |
| SFT LIBERO | OpenVLA | 0.01 | adaptive | 2×A100, LR 5e-4, batch 48 |
| SFT LIBERO | OpenVLA-OFT | 0.05 | adaptive | 4×A100, LR 5e-4, batch 32, chunk size 8 |
| RFT ManiSkill | OpenVLA | 1.0 | 0.3 | 1×A100, 390 episodes, 64 traj/ep, max 80 steps |
| RFT ManiSkill | OpenVLA-OFT | 0.1 | 0.2 | 1×A100, 650 episodes, 96 traj/ep, max 80 steps |
Sensitivity: α = 0.01–0.1 works well for SFT; α > 2.0 destabilizes RFT. σ ∈ [0.1, 2.0] yields similar RFT performance; σ < 0.05 causes collapse.
evaluation
ManiSkill — SFT (PutOnPlateInScene25Main-v3, 25 pick-and-place tasks, 15 OOD variants):
| Method | In-Dist | Vision OOD | Semantic OOD | Execution OOD | Avg OOD |
|---|---|---|---|---|---|
| OpenVLA + SFT | 78.1 | 76.6 | 57.4 | 40.4 | 58.1 |
| OpenVLA + FAN-SFT | 89.8 | 81.7 | 63.5 | 44.8 | 63.3 |
| Δ | +11.7 | +5.1 | +6.1 | +4.4 | +5.2 |
Largest single-task gains: M-Obj. (OOD) +9.3%, Disturb Recep. +7.8%, Noise-s +7.2%.
ManiSkill — RFT:
| Method | In-Dist | Vision OOD | Semantic OOD | Execution OOD | Avg OOD |
|---|---|---|---|---|---|
| OpenVLA + PPO | 95.9 | 80.1 | 79.7 | 85.8 | 81.9 |
| OpenVLA + FAN-PPO | 97.4 | 85.0 | 86.7 | 92.6 | 88.1 |
| OpenVLA-OFT + PPO | 92.3 | 84.9 | 49.0 | 55.9 | 63.3 |
| OpenVLA-OFT + FAN-PPO | 97.3 | 88.1 | 58.6 | 67.0 | 71.2 |
Sample efficiency: FAN-PPO reaches 90% rollout success in 98 steps vs. 249 for vanilla PPO on OpenVLA (~2.5× faster). For evaluation, reaches 70% in 109 steps vs. 279 (~2.6× faster).
LIBERO — SFT (4 suites):
| Method | Spatial | Object | Goal | Long | Avg |
|---|---|---|---|---|---|
| OpenVLA | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| OpenVLA + FAN-SFT | 87.2 | — | — | — | — |
| OpenVLA-OFT | 95.2 | 94.2 | 95.2 | 93.2 | 94.5 |
| OpenVLA-OFT + FAN-SFT | 98.8 | 96.6 | 97.0 | 95.2 | 96.9 |
FAN-SFT on OpenVLA-OFT surpasses UniVLA (95.2%) by +1.7% overall.
Real-World (JAKA 7-DoF + RealSense D455, 150 demos, 30 trials/task):
| Method | Task-1 (IND) | Task-2 (obj pose) | Task-3 (robot pose) | Task-4 (box pos) |
|---|---|---|---|---|
| OpenVLA + SFT | 19/30 | 7/30 | 7/30 | 1/30 |
| OpenVLA + FAN-SFT | 22/30 | 12/30 | 17/30 | 7/30 |
reproduction guide
-
Environment setup: Install ManiSkill3 (GPU-parallelized simulation) and LIBERO benchmark. All experiments on NVIDIA A100 80GB.
- Base checkpoints:
- OpenVLA (SFT warmup):
huggingface.co/gen-robot/openvla-7b-rlvla-warmup - OpenVLA (original):
huggingface.co/openvla/openvla-7b - OpenVLA-OFT:
huggingface.co/RLinf/RLinf-OpenVLAOFT-ManiSkill-Base-Lora
- OpenVLA (SFT warmup):
- FAN-SFT on ManiSkill:
- Collect 16K demonstrations via ManiSkill motion planner on PutOnPlateInScene25Main-v3
- 4×A100, LR 5e-4, batch 40, LoRA rank 32, input 224×224 px
- Train 60K steps, add $\alpha D_{\text{KL}}(\pi |\ \mathcal{N}(\mu, \Sigma))$ to NLL loss with α = 0.05
- Adaptive covariance: $\Sigma = \text{diag}(\text{Var}_a[a])$ computed from policy distribution
- FAN-PPO on ManiSkill:
- Start from SFT-warmup checkpoint
- 1×A100, policy LR 1e-4, value LR 3e-3, mini-batch 8 (OpenVLA) or 12 (OpenVLA-OFT)
- PPO: GAE λ = 0.95, clip ε = 0.2, entropy coeff = 0.0, 1 training epoch per episode
- OpenVLA: α = 1.0, σ = 0.3, 390 episodes × 64 trajectories, max 80 steps/trajectory
- OpenVLA-OFT: α = 0.1, σ = 0.2, 650 episodes × 96 trajectories, max 80 steps/trajectory
- Fixed covariance $\Sigma = \sigma^2 I$
-
Evaluation: 15 OOD variants (5 vision, 8 semantic, 3 execution). Report success rates averaged over multiple seeds.
- Tip: Start with the FAN-SFT implementation — it requires no environment interaction and is simpler to debug. The regularizer is just a KL divergence term added to the existing loss, computed between the policy logits and a Gaussian centered on the argmax action.
notes
- The method is fundamentally a structured prior on the action distribution geometry, not entropy regularization. The Gaussian target encodes unimodality + smoothness + local contiguity — properties of physical FANs.
- The distinction between adaptive covariance (SFT) and fixed covariance (RFT) is a practical stability choice, not a theoretical requirement. SFT’s supervised signal stabilizes the adaptive target; RFT needs the anchor of a fixed shape.
- Compared to label smoothing: FAN provides structured geometry-aware regularization. Label smoothing at best ε = 0.05 yields 82.8% IND vs. FAN’s 89.8% on ManiSkill SFT.
- Compared to entropy maximization: EM is unstructured and less sample-efficient. FAN-PPO converges faster and is less sensitive to hyperparameter choice.
- A Gaussian-kernel-smoothed target (multi-modal) also works (+improvement over baseline) but underperforms the unimodal Gaussian, suggesting the specific regularizer design matters.
- The FAN concept parallels observations in motor control literature — human movements exhibit invariant features within equivalence classes. The paper draws an explicit connection in the discussion.
- No code is released yet. The implementation is straightforward (~10 lines of PyTorch per loss modification) given an existing PPO/SFT training loop for VLAs.
- Applicable beyond OpenVLA: the regularizer is model-agnostic and could benefit other autoregressive VLAs (π0 with continuous actions would need adaptation), diffusion-based policies (π0, π0.5), or VQ-VLA.
- Limitations: fixed σ in RFT requires tuning per model; the Gaussian unimodal assumption may not hold for all tasks (e.g., multi-path manipulation); experiments are limited to pick-and-place primitives.