2026-04-01
DADP: Domain Adaptive Diffusion Policy
Pengcheng Wang, Qinghang Liu, Haotian Lin, Yiheng Li, Guojian Zhan, Masayoshi Tomizuka, Yixiao Wang et al.
problem
learning-based policies are coupled to specific environments. performance degrades sharply in unseen transition dynamics (different friction, mass, joint damping). prior domain adaptation methods (CORRO, CaDM, Meta-DT, MetaDiffuser) either entangle static domain info with time-varying dynamical properties, or fail to fully leverage learned domain representations in the policy.
specific limitation of diffusion policies: starting from pure gaussian noise, the denoiser must reconstruct different domain-specific action modalities from every sampled point equally. standard conditioning (input concatenation) doesn’t bias the sampling toward domain-appropriate modes.
architecture
two-stage pipeline
stage 1: domain representation learning (lagged context dynamical prediction)
standard dynamical prediction uses adjacent context $\tau_t = (s_{t-H}, a_{t-H}, \ldots, s_{t-1}, a_{t-1})$, which entangles:
- static info $\xi$: domain-specific dynamics (gravity, friction) – desired
- varying info $\omega_t$: instantaneous dynamical properties (higher-order temporal derivatives) – undesired
solution: introduce temporal offset $\Delta t$. context from episode $i$, prediction target from episode $j$ in same domain (cross-episode prediction):
\[\hat{s}\_{t+1} = f\_\theta(s\_t, a\_t, z\_{t-\Delta t}), \quad z\_{t-\Delta t} = E\_\phi(\tau\_{t-\Delta t})\]as $\Delta t \to \infty$, $I(\omega_t; z_{t-\Delta t} \mid s_t, a_t, \xi) \to 0$ by information theory. static $\xi$ remains informative since it’s time-invariant. practically: context selected from another episode in the same domain.
loss: $\mathcal{L} = \beta_{\text{forward}} \cdot |\hat{s} - s|^2 + \beta_{\text{inverse}} \cdot |\hat{a} - a|^2$ with $\beta = 1.0$.
context encoder: transformer with adaptive pooling. dim=256, 4 layers, 8 heads, history length $H=16$.
stage 2: domain-aware diffusion injection
two-part injection into the diffusion prior:
- bias the prior – start from domain-shifted initial distribution: \(x\_K = \sqrt{\bar{\alpha}\_K} \cdot (a\_0 - z) + z + \sqrt{1 - \bar{\alpha}\_K} \cdot \varepsilon\)
at step $K$: $x_K = z + \varepsilon$, so the prior is a mixed gaussian with peaks at domain-specific modalities instead of isotropic noise.
- reformulate the prediction target – predict composite term instead of pure noise: \(\hat{\varepsilon} = \sqrt{1-\bar{\alpha}\_k} \cdot \varepsilon + (1 - \bar{\alpha}\_k) \cdot \lambda z\)
flowchart LR
Ctx[context tau] --> Encoder[context encoder E_phi]
Encoder --> Z[domain rep z]
Z --> Prior[biased prior z + noise]
Z --> Target[reformulated target]
Prior --> Denoiser[DiT denoiser]
Target --> Denoiser
Denoiser --> Action[action trajectory]
diffusion policy: DiT backbone, dim=256, 6 planner layers, 8 heads, cosine noise schedule, 5 inference steps. guidance scale $\lambda = 0.1$.
training
context encoder: batch 128, LR $3 \times 10^{-4}$, 10 epochs, training ratio 0.8.
policy: batch 256, LR $3 \times 10^{-4}$, cosine noise schedule.
| environment | iterations |
|---|---|
| Walker2d | 1,000,000 |
| Ant | 400,000 |
| Hopper | 400,000 |
| HalfCheetah | 100,000 |
| Adroit Relocate | 500,000 |
| Adroit Door | 100,000 |
25 domains per MuJoCo environment. SAC expert policies per parameter setting. Adroit data from ODRL (3 domains total).
evaluation
main results (5 seeds, mean $\pm$ std)
| environment | setting | Meta-DT | DADP |
|---|---|---|---|
| Walker2d | IID | 1304 $\pm$ 586 | 3999 $\pm$ 174 |
| Walker2d | OOD | 889 $\pm$ 579 | 2834 $\pm$ 285 |
| Ant | IID | 3045 $\pm$ 128 | 3052 $\pm$ 30 |
| Ant | OOD | 3187 $\pm$ 899 | 3485 $\pm$ 83 |
| Hopper | IID | 1140 $\pm$ 156 | 1631 $\pm$ 47 |
| Hopper | OOD | 1208 $\pm$ 99 | 1686 $\pm$ 47 |
| HalfCheetah | IID | 3978 $\pm$ 66 | 3978 $\pm$ 66 |
| HalfCheetah | OOD | 3174 $\pm$ 501 | 3001 $\pm$ 225 |
| Door | IID | 1283 $\pm$ 323 | 1428 $\pm$ 44 |
| Door | OOD | 1294 $\pm$ 228 | 1494 $\pm$ 81 |
DADP wins on 8/10 settings. consistently lowest variance across seeds. the only method to outperform expert on Hopper OOD (1686 > 1555). loses to Meta-DT on HalfCheetah by small margins (tied IID, -173 OOD).
representation quality (linear probe accuracy)
| $\Delta t$ | Walker2d | HalfCheetah |
|---|---|---|
| 1 (standard) | 27.9% | 68.6% |
| 32 | 64.9% | 98.3% |
| $\infty$ (cross-episode) | 99.3% | 99.9% |
| supervised oracle | 99.8% | 99.9% |
cross-episode lagged context reaches near-supervised domain representation quality without any labels.
reproduction guide
- generate MuJoCo domains: sample 25 parameter sets per env from Table 5 ranges. train SAC expert per setting. collect 100-300 episodes per domain.
- train context encoder: transformer (dim 256, 4 layers, 8 heads, $H=16$). cross-episode prediction with $\Delta t \to \infty$ (different episodes same domain). batch 128, LR 3e-4, 10 epochs.
- train DADP: DiT (dim 256, 6 layers). mixed gaussian prior + reformulated prediction target. batch 256, LR 3e-4, 5 DDIM steps.
- evaluate: compute $z = E_\phi(h_{\text{ctx}})$ from online context, initialize $x_K = \lambda z + \varepsilon$, denoise 5 steps with guidance $\lambda=0.1$.
no code released yet. compute: moderate. 10 epochs of context encoder training + 100K-1M policy iterations per environment. feasible on single GPU.
notes
the lagged context idea is information-theoretically clean and simple to implement. the key insight is that separating context temporally (cross-episode) automatically filters out time-varying properties while retaining static domain info. reaching 99.3% linear probe accuracy without labels is remarkable.
the biased prior for diffusion is a practical trick that could transfer to other conditional generation settings where you want to bias sampling toward known modes. the $\lambda=0.1$ guidance scale is small but effective.
limitation: only addresses stationary (time-invariant) dynamics. non-stationary environments (e.g., changing payload, wear) aren’t handled. the MuJoCo domains are relatively simple compared to real-world robot dynamics with contact, deformation, and unmodeled effects.
connection to existing notes: this reinforces inference-time-guidance-pattern-robotics – domain-aware diffusion injection is another form of inference-time intervention. connects to can-discrete-flow-matching-replace-ar-and-diffusion-in-vlas – DADP shows diffusion policies still have room for architectural improvement over naive conditioning.