Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

problem

VLA finetuning inherits language-style training objectives (one-hot cross-entropy or its RL analogues such as PPO, GRPO) that assume a single correct token at each step. This ignores a fundamental property of physical manipulation: feasible action neighborhoods (FAN) — for any state $s$, there exists a connected set of neighboring actions around $a^*(s)$ that yield near-identical task progress. Formalized as:

\[\mathbb{N}\_\delta(s) \subseteq \left\{a \in A : Q(s, a^*(s)) - Q(s, a) \leq \delta \right\}\]

Two concrete failure modes result from ignoring FAN:

SFT overfitting: with small task-specific datasets (e.g., 16K trajectories), the policy collapses probability mass onto a single demonstrated action bin, producing “spiky” distributions and poor OOD generalization. OpenVLA + SFT achieves only 78.1% in-distribution success on ManiSkill vs. 89.8% with the proposed method.
RFT sample inefficiency: PPO/GRPO must implicitly discover action tolerance through exploration, requiring ~3× more training steps to reach 90% success rate compared to the proposed FAN-PPO (249 vs. 98 steps for OpenVLA on ManiSkill).

Prior regularization approaches are inadequate: label smoothing (ε = 0.05) provides only modest gains (+4.7% IND, +2.0% avg OOD on ManiSkill) and degrades at higher ε; entropy maximization is unstructured and promotes exploration rather than modeling the local geometry of action tolerance.

architecture

FAN introduces no architectural changes — it is a pure loss-level regularizer compatible with any autoregressive VLA backbone. Experiments use two VLA models:

OpenVLA [Kim et al., 2024]: SigLIP + DINOv2 dual visual encoders fused with Llama2-7B, autoregressively predicting 7-DoF action tokens over discretized bins. Outputs a single action per step.
OpenVLA-OFT [Kim et al., 2025]: Extended variant that outputs action chunks (8 steps open-loop), accepts third-person + wrist camera images and robot proprioceptive state.

Both are finetuned with LoRA (rank 32). All experiments on NVIDIA A100 80GB GPUs.

training

The core contribution is the FAN-guided regularizer, defined as the KL divergence between the policy $\pi_\theta$ and a target Gaussian $\mathcal{N}(\mu(s), \Sigma(s))$:

\[\mathcal{L}\_{\text{FAN}} = \mathbb{E}\_s\left[D\_{\text{KL}}\!\left(\pi(\cdot \mid s) \,\|\, \mathcal{N}(\cdot \mid \mu(s), \Sigma(s))\right)\right]\]

where $\mu(s) = \arg\max_a \pi(a \mid s)$ is the policy’s own mode. The implementation differs between SFT and RFT:

FAN-SFT — adaptive covariance: $\mathcal{L}\_{\text{FAN-SFT}}(\theta) = -\frac{1}{n}\sum\_{i=1}^{n}\sum\_{t=0}^{K^i-1}\left(\log \pi\_\theta(a\_t^i \mid s\_t^i, l^i) + \alpha\, D\_{\text{KL}}\!\left(\pi\_\theta(\cdot \mid s\_t^i, l^i) \| \mathcal{N}(\cdot \mid \mu, \Sigma)\right)\right)$

The covariance is dynamically set to the policy’s own variance: $\Sigma(s) = \text{diag}!\left(\sum_{a \in A} \pi(a \mid s, l)(a - \mu(s))^2\right)$.

FAN-PPO — fixed covariance $\Sigma = \sigma^2 I$ for training stability: $\mathcal{L}\_{\text{FAN-PPO}}(\theta) = -\frac{1}{K}\sum\_{k=0}^{K-1}\left[\min\!\left(\hat{I}\_t^k \hat{A},\, \text{Clip}(\hat{I}\_t^k, 1{-}\epsilon, 1{+}\epsilon)\hat{A}\right) - \alpha\, D\_{\text{KL}}\!\left(\pi\_\theta(\cdot \mid s\_k, l) \| \mathcal{N}(\cdot \mid \mu(s\_k), \Sigma)\right)\right]$

The optimal policy has a closed-form (Proposition 1): $\pi_{t+1}(a \mid s, l) \propto \mathcal{N}(a \mid \mu, \Sigma)^{\frac{\alpha}{\alpha+\beta^}} \pi_t(a \mid s, l)^{\frac{\beta^}{\alpha+\beta^}} \exp!\left(\frac{Q^{\pi_t}(s,a,l)}{\alpha+\beta^}\right)$, revealing a geometric interpolation between the target Gaussian and the previous policy, re-weighted by Q-values.

Key hyperparameters:

Setting	Model	α	σ	Notes
SFT ManiSkill	OpenVLA	0.05	adaptive	4×A100, LR 5e-4, batch 40, 60K steps, 16K demos
SFT LIBERO	OpenVLA	0.01	adaptive	2×A100, LR 5e-4, batch 48
SFT LIBERO	OpenVLA-OFT	0.05	adaptive	4×A100, LR 5e-4, batch 32, chunk size 8
RFT ManiSkill	OpenVLA	1.0	0.3	1×A100, 390 episodes, 64 traj/ep, max 80 steps
RFT ManiSkill	OpenVLA-OFT	0.1	0.2	1×A100, 650 episodes, 96 traj/ep, max 80 steps

Sensitivity: α = 0.01–0.1 works well for SFT; α > 2.0 destabilizes RFT. σ ∈ [0.1, 2.0] yields similar RFT performance; σ < 0.05 causes collapse.

evaluation

ManiSkill — SFT (PutOnPlateInScene25Main-v3, 25 pick-and-place tasks, 15 OOD variants):

Method	In-Dist	Vision OOD	Semantic OOD	Execution OOD	Avg OOD
OpenVLA + SFT	78.1	76.6	57.4	40.4	58.1
OpenVLA + FAN-SFT	89.8	81.7	63.5	44.8	63.3
Δ	+11.7	+5.1	+6.1	+4.4	+5.2

Largest single-task gains: M-Obj. (OOD) +9.3%, Disturb Recep. +7.8%, Noise-s +7.2%.

ManiSkill — RFT:

Method	In-Dist	Vision OOD	Semantic OOD	Execution OOD	Avg OOD
OpenVLA + PPO	95.9	80.1	79.7	85.8	81.9
OpenVLA + FAN-PPO	97.4	85.0	86.7	92.6	88.1
OpenVLA-OFT + PPO	92.3	84.9	49.0	55.9	63.3
OpenVLA-OFT + FAN-PPO	97.3	88.1	58.6	67.0	71.2

Sample efficiency: FAN-PPO reaches 90% rollout success in 98 steps vs. 249 for vanilla PPO on OpenVLA (~2.5× faster). For evaluation, reaches 70% in 109 steps vs. 279 (~2.6× faster).

LIBERO — SFT (4 suites):

Method	Spatial	Object	Goal	Long	Avg
OpenVLA	84.7	88.4	79.2	53.7	76.5
OpenVLA + FAN-SFT	87.2	—	—	—	—
OpenVLA-OFT	95.2	94.2	95.2	93.2	94.5
OpenVLA-OFT + FAN-SFT	98.8	96.6	97.0	95.2	96.9

FAN-SFT on OpenVLA-OFT surpasses UniVLA (95.2%) by +1.7% overall.

Real-World (JAKA 7-DoF + RealSense D455, 150 demos, 30 trials/task):

Method	Task-1 (IND)	Task-2 (obj pose)	Task-3 (robot pose)	Task-4 (box pos)
OpenVLA + SFT	19/30	7/30	7/30	1/30
OpenVLA + FAN-SFT	22/30	12/30	17/30	7/30

reproduction guide

Environment setup: Install ManiSkill3 (GPU-parallelized simulation) and LIBERO benchmark. All experiments on NVIDIA A100 80GB.
Base checkpoints:
- OpenVLA (SFT warmup): huggingface.co/gen-robot/openvla-7b-rlvla-warmup
- OpenVLA (original): huggingface.co/openvla/openvla-7b
- OpenVLA-OFT: huggingface.co/RLinf/RLinf-OpenVLAOFT-ManiSkill-Base-Lora
FAN-SFT on ManiSkill:
- Collect 16K demonstrations via ManiSkill motion planner on PutOnPlateInScene25Main-v3
- 4×A100, LR 5e-4, batch 40, LoRA rank 32, input 224×224 px
- Train 60K steps, add $\alpha D_{\text{KL}}(\pi |\ \mathcal{N}(\mu, \Sigma))$ to NLL loss with α = 0.05
- Adaptive covariance: $\Sigma = \text{diag}(\text{Var}_a[a])$ computed from policy distribution
FAN-PPO on ManiSkill:
- Start from SFT-warmup checkpoint
- 1×A100, policy LR 1e-4, value LR 3e-3, mini-batch 8 (OpenVLA) or 12 (OpenVLA-OFT)
- PPO: GAE λ = 0.95, clip ε = 0.2, entropy coeff = 0.0, 1 training epoch per episode
- OpenVLA: α = 1.0, σ = 0.3, 390 episodes × 64 trajectories, max 80 steps/trajectory
- OpenVLA-OFT: α = 0.1, σ = 0.2, 650 episodes × 96 trajectories, max 80 steps/trajectory
- Fixed covariance $\Sigma = \sigma^2 I$
Evaluation: 15 OOD variants (5 vision, 8 semantic, 3 execution). Report success rates averaged over multiple seeds.
Tip: Start with the FAN-SFT implementation — it requires no environment interaction and is simpler to debug. The regularizer is just a KL divergence term added to the existing loss, computed between the policy logits and a Gaussian centered on the argmax action.

notes

The method is fundamentally a structured prior on the action distribution geometry, not entropy regularization. The Gaussian target encodes unimodality + smoothness + local contiguity — properties of physical FANs.
The distinction between adaptive covariance (SFT) and fixed covariance (RFT) is a practical stability choice, not a theoretical requirement. SFT’s supervised signal stabilizes the adaptive target; RFT needs the anchor of a fixed shape.
Compared to label smoothing: FAN provides structured geometry-aware regularization. Label smoothing at best ε = 0.05 yields 82.8% IND vs. FAN’s 89.8% on ManiSkill SFT.
Compared to entropy maximization: EM is unstructured and less sample-efficient. FAN-PPO converges faster and is less sensitive to hyperparameter choice.
A Gaussian-kernel-smoothed target (multi-modal) also works (+improvement over baseline) but underperforms the unimodal Gaussian, suggesting the specific regularizer design matters.
The FAN concept parallels observations in motor control literature — human movements exhibit invariant features within equivalence classes. The paper draws an explicit connection in the discussion.
No code is released yet. The implementation is straightforward (~10 lines of PyTorch per loss modification) given an existing PPO/SFT training loop for VLAs.
Applicable beyond OpenVLA: the regularizer is model-agnostic and could benefit other autoregressive VLAs (π0 with continuous actions would need adaptation), diffusion-based policies (π0, π0.5), or VQ-VLA.
Limitations: fixed σ in RFT requires tuning per model; the Gaussian unimodal assumption may not hold for all tasks (e.g., multi-path manipulation); experiments are limited to pick-and-place primitives.