2026-03-30

FODMP: Fast One-Step Diffusion of Movement Primitives Generation for Time-Dependent Robot Actions

Xirui Shi, Arya Ebrahimi, Yi Hu, Jun Jin

diffusion-policy movement-primitives consistency-models real-time-robotics

problem

diffusion policies for robot action generation face a tradeoff between expressiveness and speed. action-chunking diffusion policies (like Diffusion Policy, ManiCM) are fast but only predict short action segments, making them reactive and unable to capture time-dependent motion profiles like acceleration-deceleration behaviors. Movement Primitive Diffusion (MPD) captures temporal structure by diffusing over ProDMP parameters, but its multi-step denoising makes it too slow for closed-loop control (168.6 ms per step). FODMP closes this gap by distilling MPD into a one-step consistency model that generates temporally structured motion primitives at 17.2 ms per step.

architecture

FODMP has two stages:

teacher: multi-step MPD. a network $E_\vartheta$ maps noisy action samples $\tilde{\tau}$ and observations $o$ to ProDMP parameters $\tilde{\theta} = E_\vartheta(\tilde{\tau}, o, t)$. the ProDMP decoder $P_\Phi$ then converts parameters to full trajectories. the teacher requires 10-50 denoising steps. ProDMP trajectory: $y(t) = c_1 y_1(t) + c_2 y_2(t) + \Phi(t)^\top \theta$, where $\theta = [w, g]$ encodes forcing weights and goal attractor.

student: one-step consistency model. consistency distillation transfers the teacher into $f_\phi$ that maps noisy parameters directly to clean ones in a single step. the distillation loss enforces self-consistency along the PF-ODE:

\[\mathcal{L}\_{CD} = \mathbb{E}\left[\lambda(t\_n) \, d\left(f\_\phi(\theta\_{n+k}, o, t\_{n+k}) - f\_{\phi^-}(\hat{\theta}\_n, o, t\_n)\right)\right]\]

where $f_{\phi^-}$ is an EMA target network and $d(\cdot, \cdot)$ is a distance metric. at inference: sample $\theta_T \sim \mathcal{N}(0, I)$, compute $\theta_0 = f_\phi(o, \theta_T, T)$, decode trajectory via ProDMP, execute in a receding-horizon loop.

closed-loop control: at each control step, the policy samples noise, runs one forward pass through the student, decodes a full ProDMP trajectory, and executes it. re-planning happens at the next control step.

training

  • teacher trained following MPD protocol with multi-step diffusion over ProDMP parameters
  • student distilled from teacher via consistency distillation with EMA target
  • 3 independent training seeds, checkpoints saved every 100 epochs, last 10 averaged
  • 350 demonstrations for real-world tasks (Push-T, Ball Catching)
  • velocity control for FODMP/MPD, position control for DP/ManiCM

evaluation

simulation (MetaWorld + ManiSkill, 11 tasks):

method easy medium hard avg success avg time (ms)
DP 99.3% 41.0% 10.1% 50.1% 119.7
ManiCM 79.2% 18.9% 5.2% 34.4% 16.2
MPD 98.9% 64.8% 28.6% 64.1% 168.6
FODMP 99.2% 86.3% 49.0% 78.2% 17.2

FODMP is 7x faster than DP, 10x faster than MPD, while achieving the highest success rate across all difficulty levels.

real-world Push-T (Franka Panda, 350 demos): FODMP outperforms DP by 19.7%, ManiCM by 23.6%, MPD by 9.2%. superior data efficiency – reaches high performance with fewer demos.

real-world Ball Catching (Franka Panda, 350 demos, 3 difficulty levels): FODMP outperforms DP by 68.2%, ManiCM by 26.1%, MPD by 36.2%. only FODMP successfully catches fast-flying balls – baselines are too slow to react.

reproduction guide

  1. train a multi-step MPD teacher on target task demonstrations following the MPD protocol
  2. distill teacher into one-step student using consistency distillation (EMA target, skip interval k)
  3. at inference: sample noise, single forward pass, ProDMP decode, execute in receding-horizon loop
  4. expected: ~17 ms per step, suitable for real-time closed-loop control
  5. gotchas: teacher quality matters – distillation can’t exceed teacher performance. velocity vs position control must match training. the action decoder is modular and could plug into larger VLA architectures like $\pi_0$

notes

  • the key design insight: perform diffusion directly over ProDMP parameters (not over raw actions), then use consistency distillation for one-step inference. this decouples temporal structure (ProDMP) from generation speed (consistency model)
  • the ball catching result is striking – it’s a genuinely hard dynamic task where inference speed directly determines success. only FODMP works because it’s the only method fast enough AND temporally coherent
  • the modular action decoder design means FODMP could be dropped into any diffusion-based VLA as the “action expert,” replacing multi-step denoising with one-step generation
  • connects to the speed-vs-quality angle: FODMP doesn’t trade quality for speed, it gets both. the consistency distillation preserves teacher quality while achieving ManiCM-level latency