2026-03-30
FODMP: Fast One-Step Diffusion of Movement Primitives Generation for Time-Dependent Robot Actions
Xirui Shi, Arya Ebrahimi, Yi Hu, Jun Jin
problem
diffusion policies for robot action generation face a tradeoff between expressiveness and speed. action-chunking diffusion policies (like Diffusion Policy, ManiCM) are fast but only predict short action segments, making them reactive and unable to capture time-dependent motion profiles like acceleration-deceleration behaviors. Movement Primitive Diffusion (MPD) captures temporal structure by diffusing over ProDMP parameters, but its multi-step denoising makes it too slow for closed-loop control (168.6 ms per step). FODMP closes this gap by distilling MPD into a one-step consistency model that generates temporally structured motion primitives at 17.2 ms per step.
architecture
FODMP has two stages:
teacher: multi-step MPD. a network $E_\vartheta$ maps noisy action samples $\tilde{\tau}$ and observations $o$ to ProDMP parameters $\tilde{\theta} = E_\vartheta(\tilde{\tau}, o, t)$. the ProDMP decoder $P_\Phi$ then converts parameters to full trajectories. the teacher requires 10-50 denoising steps. ProDMP trajectory: $y(t) = c_1 y_1(t) + c_2 y_2(t) + \Phi(t)^\top \theta$, where $\theta = [w, g]$ encodes forcing weights and goal attractor.
student: one-step consistency model. consistency distillation transfers the teacher into $f_\phi$ that maps noisy parameters directly to clean ones in a single step. the distillation loss enforces self-consistency along the PF-ODE:
\[\mathcal{L}\_{CD} = \mathbb{E}\left[\lambda(t\_n) \, d\left(f\_\phi(\theta\_{n+k}, o, t\_{n+k}) - f\_{\phi^-}(\hat{\theta}\_n, o, t\_n)\right)\right]\]where $f_{\phi^-}$ is an EMA target network and $d(\cdot, \cdot)$ is a distance metric. at inference: sample $\theta_T \sim \mathcal{N}(0, I)$, compute $\theta_0 = f_\phi(o, \theta_T, T)$, decode trajectory via ProDMP, execute in a receding-horizon loop.
closed-loop control: at each control step, the policy samples noise, runs one forward pass through the student, decodes a full ProDMP trajectory, and executes it. re-planning happens at the next control step.
training
- teacher trained following MPD protocol with multi-step diffusion over ProDMP parameters
- student distilled from teacher via consistency distillation with EMA target
- 3 independent training seeds, checkpoints saved every 100 epochs, last 10 averaged
- 350 demonstrations for real-world tasks (Push-T, Ball Catching)
- velocity control for FODMP/MPD, position control for DP/ManiCM
evaluation
simulation (MetaWorld + ManiSkill, 11 tasks):
| method | easy | medium | hard | avg success | avg time (ms) |
|---|---|---|---|---|---|
| DP | 99.3% | 41.0% | 10.1% | 50.1% | 119.7 |
| ManiCM | 79.2% | 18.9% | 5.2% | 34.4% | 16.2 |
| MPD | 98.9% | 64.8% | 28.6% | 64.1% | 168.6 |
| FODMP | 99.2% | 86.3% | 49.0% | 78.2% | 17.2 |
FODMP is 7x faster than DP, 10x faster than MPD, while achieving the highest success rate across all difficulty levels.
real-world Push-T (Franka Panda, 350 demos): FODMP outperforms DP by 19.7%, ManiCM by 23.6%, MPD by 9.2%. superior data efficiency – reaches high performance with fewer demos.
real-world Ball Catching (Franka Panda, 350 demos, 3 difficulty levels): FODMP outperforms DP by 68.2%, ManiCM by 26.1%, MPD by 36.2%. only FODMP successfully catches fast-flying balls – baselines are too slow to react.
reproduction guide
- train a multi-step MPD teacher on target task demonstrations following the MPD protocol
- distill teacher into one-step student using consistency distillation (EMA target, skip interval k)
- at inference: sample noise, single forward pass, ProDMP decode, execute in receding-horizon loop
- expected: ~17 ms per step, suitable for real-time closed-loop control
- gotchas: teacher quality matters – distillation can’t exceed teacher performance. velocity vs position control must match training. the action decoder is modular and could plug into larger VLA architectures like $\pi_0$
notes
- the key design insight: perform diffusion directly over ProDMP parameters (not over raw actions), then use consistency distillation for one-step inference. this decouples temporal structure (ProDMP) from generation speed (consistency model)
- the ball catching result is striking – it’s a genuinely hard dynamic task where inference speed directly determines success. only FODMP works because it’s the only method fast enough AND temporally coherent
- the modular action decoder design means FODMP could be dropped into any diffusion-based VLA as the “action expert,” replacing multi-step denoising with one-step generation
- connects to the speed-vs-quality angle: FODMP doesn’t trade quality for speed, it gets both. the consistency distillation preserves teacher quality while achieving ManiCM-level latency