FODMP: Fast One-Step Diffusion of Movement Primitives Generation for Time-Dependent Robot Actions

problem

diffusion policies for robot action generation face a tradeoff between expressiveness and speed. action-chunking diffusion policies (like Diffusion Policy, ManiCM) are fast but only predict short action segments, making them reactive and unable to capture time-dependent motion profiles like acceleration-deceleration behaviors. Movement Primitive Diffusion (MPD) captures temporal structure by diffusing over ProDMP parameters, but its multi-step denoising makes it too slow for closed-loop control (168.6 ms per step). FODMP closes this gap by distilling MPD into a one-step consistency model that generates temporally structured motion primitives at 17.2 ms per step.

architecture

FODMP has two stages:

teacher: multi-step MPD. a network $E_\vartheta$ maps noisy action samples $\tilde{\tau}$ and observations $o$ to ProDMP parameters $\tilde{\theta} = E_\vartheta(\tilde{\tau}, o, t)$. the ProDMP decoder $P_\Phi$ then converts parameters to full trajectories. the teacher requires 10-50 denoising steps. ProDMP trajectory: $y(t) = c_1 y_1(t) + c_2 y_2(t) + \Phi(t)^\top \theta$, where $\theta = [w, g]$ encodes forcing weights and goal attractor.

student: one-step consistency model. consistency distillation transfers the teacher into $f_\phi$ that maps noisy parameters directly to clean ones in a single step. the distillation loss enforces self-consistency along the PF-ODE:

\[\mathcal{L}\_{CD} = \mathbb{E}\left[\lambda(t\_n) \, d\left(f\_\phi(\theta\_{n+k}, o, t\_{n+k}) - f\_{\phi^-}(\hat{\theta}\_n, o, t\_n)\right)\right]\]

where $f_{\phi^-}$ is an EMA target network and $d(\cdot, \cdot)$ is a distance metric. at inference: sample $\theta_T \sim \mathcal{N}(0, I)$, compute $\theta_0 = f_\phi(o, \theta_T, T)$, decode trajectory via ProDMP, execute in a receding-horizon loop.

closed-loop control: at each control step, the policy samples noise, runs one forward pass through the student, decodes a full ProDMP trajectory, and executes it. re-planning happens at the next control step.

training

teacher trained following MPD protocol with multi-step diffusion over ProDMP parameters
student distilled from teacher via consistency distillation with EMA target
3 independent training seeds, checkpoints saved every 100 epochs, last 10 averaged
350 demonstrations for real-world tasks (Push-T, Ball Catching)
velocity control for FODMP/MPD, position control for DP/ManiCM

evaluation

simulation (MetaWorld + ManiSkill, 11 tasks):

method	easy	medium	hard	avg success	avg time (ms)
DP	99.3%	41.0%	10.1%	50.1%	119.7
ManiCM	79.2%	18.9%	5.2%	34.4%	16.2
MPD	98.9%	64.8%	28.6%	64.1%	168.6
FODMP	99.2%	86.3%	49.0%	78.2%	17.2

FODMP is 7x faster than DP, 10x faster than MPD, while achieving the highest success rate across all difficulty levels.

real-world Push-T (Franka Panda, 350 demos): FODMP outperforms DP by 19.7%, ManiCM by 23.6%, MPD by 9.2%. superior data efficiency – reaches high performance with fewer demos.

real-world Ball Catching (Franka Panda, 350 demos, 3 difficulty levels): FODMP outperforms DP by 68.2%, ManiCM by 26.1%, MPD by 36.2%. only FODMP successfully catches fast-flying balls – baselines are too slow to react.

reproduction guide

train a multi-step MPD teacher on target task demonstrations following the MPD protocol
distill teacher into one-step student using consistency distillation (EMA target, skip interval k)
at inference: sample noise, single forward pass, ProDMP decode, execute in receding-horizon loop
expected: ~17 ms per step, suitable for real-time closed-loop control
gotchas: teacher quality matters – distillation can’t exceed teacher performance. velocity vs position control must match training. the action decoder is modular and could plug into larger VLA architectures like $\pi_0$

notes

the key design insight: perform diffusion directly over ProDMP parameters (not over raw actions), then use consistency distillation for one-step inference. this decouples temporal structure (ProDMP) from generation speed (consistency model)
the ball catching result is striking – it’s a genuinely hard dynamic task where inference speed directly determines success. only FODMP works because it’s the only method fast enough AND temporally coherent
the modular action decoder design means FODMP could be dropped into any diffusion-based VLA as the “action expert,” replacing multi-step denoising with one-step generation
connects to the speed-vs-quality angle: FODMP doesn’t trade quality for speed, it gets both. the consistency distillation preserves teacher quality while achieving ManiCM-level latency