2026-04-05

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

Haochen Niu, Kanyu Zhang, Shuyu Yin, Qinghai Guo, Peilin Liu, Fei Wen et al.

VLA action-prior RL-finetuning

problem

VLA finetuning inherits language-style training objectives (one-hot cross-entropy or its RL analogues such as PPO, GRPO) that assume a single correct token at each step. This ignores a fundamental property of physical manipulation: feasible action neighborhoods (FAN) — for any state $s$, there exists a connected set of neighboring actions around $a^*(s)$ that yield near-identical task progress. Formalized as:

\[\mathbb{N}\_\delta(s) \subseteq \left\{a \in A : Q(s, a^*(s)) - Q(s, a) \leq \delta \right\}\]

Two concrete failure modes result from ignoring FAN:

  1. SFT overfitting: with small task-specific datasets (e.g., 16K trajectories), the policy collapses probability mass onto a single demonstrated action bin, producing “spiky” distributions and poor OOD generalization. OpenVLA + SFT achieves only 78.1% in-distribution success on ManiSkill vs. 89.8% with the proposed method.
  2. RFT sample inefficiency: PPO/GRPO must implicitly discover action tolerance through exploration, requiring ~3× more training steps to reach 90% success rate compared to the proposed FAN-PPO (249 vs. 98 steps for OpenVLA on ManiSkill).

Prior regularization approaches are inadequate: label smoothing (ε = 0.05) provides only modest gains (+4.7% IND, +2.0% avg OOD on ManiSkill) and degrades at higher ε; entropy maximization is unstructured and promotes exploration rather than modeling the local geometry of action tolerance.

architecture

FAN introduces no architectural changes — it is a pure loss-level regularizer compatible with any autoregressive VLA backbone. Experiments use two VLA models:

  • OpenVLA [Kim et al., 2024]: SigLIP + DINOv2 dual visual encoders fused with Llama2-7B, autoregressively predicting 7-DoF action tokens over discretized bins. Outputs a single action per step.
  • OpenVLA-OFT [Kim et al., 2025]: Extended variant that outputs action chunks (8 steps open-loop), accepts third-person + wrist camera images and robot proprioceptive state.

Both are finetuned with LoRA (rank 32). All experiments on NVIDIA A100 80GB GPUs.

training

The core contribution is the FAN-guided regularizer, defined as the KL divergence between the policy $\pi_\theta$ and a target Gaussian $\mathcal{N}(\mu(s), \Sigma(s))$:

\[\mathcal{L}\_{\text{FAN}} = \mathbb{E}\_s\left[D\_{\text{KL}}\!\left(\pi(\cdot \mid s) \,\|\, \mathcal{N}(\cdot \mid \mu(s), \Sigma(s))\right)\right]\]

where $\mu(s) = \arg\max_a \pi(a \mid s)$ is the policy’s own mode. The implementation differs between SFT and RFT:

FAN-SFT — adaptive covariance: \(\mathcal{L}\_{\text{FAN-SFT}}(\theta) = -\frac{1}{n}\sum\_{i=1}^{n}\sum\_{t=0}^{K^i-1}\left(\log \pi\_\theta(a\_t^i \mid s\_t^i, l^i) + \alpha\, D\_{\text{KL}}\!\left(\pi\_\theta(\cdot \mid s\_t^i, l^i) \| \mathcal{N}(\cdot \mid \mu, \Sigma)\right)\right)\)

The covariance is dynamically set to the policy’s own variance: $\Sigma(s) = \text{diag}!\left(\sum_{a \in A} \pi(a \mid s, l)(a - \mu(s))^2\right)$.

FAN-PPO — fixed covariance $\Sigma = \sigma^2 I$ for training stability: \(\mathcal{L}\_{\text{FAN-PPO}}(\theta) = -\frac{1}{K}\sum\_{k=0}^{K-1}\left[\min\!\left(\hat{I}\_t^k \hat{A},\, \text{Clip}(\hat{I}\_t^k, 1{-}\epsilon, 1{+}\epsilon)\hat{A}\right) - \alpha\, D\_{\text{KL}}\!\left(\pi\_\theta(\cdot \mid s\_k, l) \| \mathcal{N}(\cdot \mid \mu(s\_k), \Sigma)\right)\right]\)

The optimal policy has a closed-form (Proposition 1): $\pi_{t+1}(a \mid s, l) \propto \mathcal{N}(a \mid \mu, \Sigma)^{\frac{\alpha}{\alpha+\beta^}} \pi_t(a \mid s, l)^{\frac{\beta^}{\alpha+\beta^}} \exp!\left(\frac{Q^{\pi_t}(s,a,l)}{\alpha+\beta^}\right)$, revealing a geometric interpolation between the target Gaussian and the previous policy, re-weighted by Q-values.

Key hyperparameters:

Setting Model α σ Notes
SFT ManiSkill OpenVLA 0.05 adaptive 4×A100, LR 5e-4, batch 40, 60K steps, 16K demos
SFT LIBERO OpenVLA 0.01 adaptive 2×A100, LR 5e-4, batch 48
SFT LIBERO OpenVLA-OFT 0.05 adaptive 4×A100, LR 5e-4, batch 32, chunk size 8
RFT ManiSkill OpenVLA 1.0 0.3 1×A100, 390 episodes, 64 traj/ep, max 80 steps
RFT ManiSkill OpenVLA-OFT 0.1 0.2 1×A100, 650 episodes, 96 traj/ep, max 80 steps

Sensitivity: α = 0.01–0.1 works well for SFT; α > 2.0 destabilizes RFT. σ ∈ [0.1, 2.0] yields similar RFT performance; σ < 0.05 causes collapse.

evaluation

ManiSkill — SFT (PutOnPlateInScene25Main-v3, 25 pick-and-place tasks, 15 OOD variants):

Method In-Dist Vision OOD Semantic OOD Execution OOD Avg OOD
OpenVLA + SFT 78.1 76.6 57.4 40.4 58.1
OpenVLA + FAN-SFT 89.8 81.7 63.5 44.8 63.3
Δ +11.7 +5.1 +6.1 +4.4 +5.2

Largest single-task gains: M-Obj. (OOD) +9.3%, Disturb Recep. +7.8%, Noise-s +7.2%.

ManiSkill — RFT:

Method In-Dist Vision OOD Semantic OOD Execution OOD Avg OOD
OpenVLA + PPO 95.9 80.1 79.7 85.8 81.9
OpenVLA + FAN-PPO 97.4 85.0 86.7 92.6 88.1
OpenVLA-OFT + PPO 92.3 84.9 49.0 55.9 63.3
OpenVLA-OFT + FAN-PPO 97.3 88.1 58.6 67.0 71.2

Sample efficiency: FAN-PPO reaches 90% rollout success in 98 steps vs. 249 for vanilla PPO on OpenVLA (~2.5× faster). For evaluation, reaches 70% in 109 steps vs. 279 (~2.6× faster).

LIBERO — SFT (4 suites):

Method Spatial Object Goal Long Avg
OpenVLA 84.7 88.4 79.2 53.7 76.5
OpenVLA + FAN-SFT 87.2
OpenVLA-OFT 95.2 94.2 95.2 93.2 94.5
OpenVLA-OFT + FAN-SFT 98.8 96.6 97.0 95.2 96.9

FAN-SFT on OpenVLA-OFT surpasses UniVLA (95.2%) by +1.7% overall.

Real-World (JAKA 7-DoF + RealSense D455, 150 demos, 30 trials/task):

Method Task-1 (IND) Task-2 (obj pose) Task-3 (robot pose) Task-4 (box pos)
OpenVLA + SFT 19/30 7/30 7/30 1/30
OpenVLA + FAN-SFT 22/30 12/30 17/30 7/30

reproduction guide

  1. Environment setup: Install ManiSkill3 (GPU-parallelized simulation) and LIBERO benchmark. All experiments on NVIDIA A100 80GB.

  2. Base checkpoints:
    • OpenVLA (SFT warmup): huggingface.co/gen-robot/openvla-7b-rlvla-warmup
    • OpenVLA (original): huggingface.co/openvla/openvla-7b
    • OpenVLA-OFT: huggingface.co/RLinf/RLinf-OpenVLAOFT-ManiSkill-Base-Lora
  3. FAN-SFT on ManiSkill:
    • Collect 16K demonstrations via ManiSkill motion planner on PutOnPlateInScene25Main-v3
    • 4×A100, LR 5e-4, batch 40, LoRA rank 32, input 224×224 px
    • Train 60K steps, add $\alpha D_{\text{KL}}(\pi |\ \mathcal{N}(\mu, \Sigma))$ to NLL loss with α = 0.05
    • Adaptive covariance: $\Sigma = \text{diag}(\text{Var}_a[a])$ computed from policy distribution
  4. FAN-PPO on ManiSkill:
    • Start from SFT-warmup checkpoint
    • 1×A100, policy LR 1e-4, value LR 3e-3, mini-batch 8 (OpenVLA) or 12 (OpenVLA-OFT)
    • PPO: GAE λ = 0.95, clip ε = 0.2, entropy coeff = 0.0, 1 training epoch per episode
    • OpenVLA: α = 1.0, σ = 0.3, 390 episodes × 64 trajectories, max 80 steps/trajectory
    • OpenVLA-OFT: α = 0.1, σ = 0.2, 650 episodes × 96 trajectories, max 80 steps/trajectory
    • Fixed covariance $\Sigma = \sigma^2 I$
  5. Evaluation: 15 OOD variants (5 vision, 8 semantic, 3 execution). Report success rates averaged over multiple seeds.

  6. Tip: Start with the FAN-SFT implementation — it requires no environment interaction and is simpler to debug. The regularizer is just a KL divergence term added to the existing loss, computed between the policy logits and a Gaussian centered on the argmax action.

notes

  • The method is fundamentally a structured prior on the action distribution geometry, not entropy regularization. The Gaussian target encodes unimodality + smoothness + local contiguity — properties of physical FANs.
  • The distinction between adaptive covariance (SFT) and fixed covariance (RFT) is a practical stability choice, not a theoretical requirement. SFT’s supervised signal stabilizes the adaptive target; RFT needs the anchor of a fixed shape.
  • Compared to label smoothing: FAN provides structured geometry-aware regularization. Label smoothing at best ε = 0.05 yields 82.8% IND vs. FAN’s 89.8% on ManiSkill SFT.
  • Compared to entropy maximization: EM is unstructured and less sample-efficient. FAN-PPO converges faster and is less sensitive to hyperparameter choice.
  • A Gaussian-kernel-smoothed target (multi-modal) also works (+improvement over baseline) but underperforms the unimodal Gaussian, suggesting the specific regularizer design matters.
  • The FAN concept parallels observations in motor control literature — human movements exhibit invariant features within equivalence classes. The paper draws an explicit connection in the discussion.
  • No code is released yet. The implementation is straightforward (~10 lines of PyTorch per loss modification) given an existing PPO/SFT training loop for VLAs.
  • Applicable beyond OpenVLA: the regularizer is model-agnostic and could benefit other autoregressive VLAs (π0 with continuous actions would need adaptation), diffusion-based policies (π0, π0.5), or VQ-VLA.
  • Limitations: fixed σ in RFT requires tuning per model; the Gaussian unimodal assumption may not hold for all tasks (e.g., multi-path manipulation); experiments are limited to pick-and-place primitives.