2026-04-05
Universal Hypernetworks for Arbitrary Models
Xuanfeng Zhou
Problem
Conventional hypernetworks (Ha et al., 2016; Von Oswald et al., 2019; Chen & Wang, 2022; Hedlin et al., 2025) are architecturally coupled to their target base models. Their output space — the number of per-layer/per-chunk learned embeddings, specialized output heads, or overall output dimensionality — is tied to the base-network architecture. Concretely, for a chunked hypernetwork (Von Oswald et al., 2019) that partitions $N$ target parameters into $N_\text{chunks}$ chunks of size $c$, the total hypernetwork parameter count satisfies
\[N\_H = N\_{H,0} + oc + N\_\text{chunks} \cdot d\_\text{emb},\]which by the AM–GM inequality yields $N_H = \Omega(\sqrt{N})$. Changing the target model family typically requires redesigning and retraining the hypernetwork from scratch.
Additional specific limitations of prior methods:
- Ha et al. (2016) — generates only convolution kernels (not all parameters); embedding count grows with model size.
- Von Oswald et al. (2019) — chunked generation with learned per-chunk embeddings that scale with target size.
- Chen & Wang (2022); Navon et al. (2020) — architecture-coupled output parameterization forces redesign when the target model changes.
- Zhou et al. (2024); Kim et al. (2026) — output space sized for the largest architecture in the family, leading to wasted outputs or ad-hoc masking/truncation for smaller models.
- Hedlin et al. (2025) — learned per-block embedding sets whose number grows with generated blocks.
- Knyazev et al. (2021, 2023) — generated weights underperform direct training and require fine-tuning before deployment.
Recursive hypernetwork generation has been rarely explored and is challenging due to scaling and initialization instability (Liao et al., 2023; Lutati & Wolf, 2021).
Architecture
UHN is a fixed-architecture generator $H_\theta$ that predicts each scalar weight $w_i$ of a base model with $N$ parameters from deterministic descriptors:
\[w\_i = H\_\theta\!\left(\mathbf{v}\_i,\; \mathbf{s}\_g,\; \{\mathbf{s}\_{\ell,j}\}\_{j=1}^{L},\; \mathbf{t}\right)\]Key principle: all target-specificity moves into the conditioning inputs, not the generator architecture.
Descriptors (4 types)
| Descriptor | Dim | Content |
|---|---|---|
| Index $\mathbf{v}_i \in \mathbb{R}^{10}$ | 10 | layer idx, layer type, param type, out idx, in idx, kernel h/w idx, embedding idx, sequence idx, grid idx |
| Global structure $\mathbf{s}_g \in \mathbb{R}^{6}$ | 6 | model type, num layers, cnn stage num, num encoders, num structure/index freqs (recursive) |
| Local structure $\mathbf{s}_{\ell,j} \in \mathbb{R}^{21}$ | 21 | per-layer: bias/norm/shortcut/activation type, I/O sizes, dropout, group num, kernel size, num heads, grid size, spline order, etc. |
| Task $\mathbf{t} \in \mathbb{R}^{2}$ | 2 | task type, dataset type |
Stage 1: Encode descriptors
Each descriptor vector $x$ is attribute-wise normalized (standardizing a uniform distribution) to $\hat{x}$, then mapped via Gaussian Fourier features (Tancik et al., 2020):
\[\gamma\_B(\hat{x}) = \left[\cos(B\hat{x})^\top,\; \sin(B\hat{x})^\top\right]^\top \in \mathbb{R}^{2m}\]where $B \in \mathbb{R}^{m \times n}$ has entries sampled i.i.d. from $\mathcal{N}(0, \sigma^2)$ and kept fixed. For index descriptors: $\phi_i = \gamma_{B_v}(\hat{\mathbf{v}}_i)$; for per-layer task-structure descriptors $\mathbf{u}_j = [\mathbf{s}_g; \mathbf{t}; \mathbf{s}_{\ell,j}]$: $\psi_j = \gamma_{B_u}(\hat{\mathbf{u}}_j)$.
Stage 2: Map encodings to weights
Index branch — MLP with pre-activation residual blocks (He et al., 2016): input linear (width $d$) → 2 residual blocks (each: ReLU → LayerNorm → Linear with width $d$) → shortcut connections.
Task-structure encoder (optional) — single-layer Transformer encoder (Vaswani et al., 2017) with $h$ heads applied to ${\psi_j}_{j=1}^{L}$ → mean pooling → 2-layer MLP (Linear → ReLU → Linear, width $d$; last linear zero-initialized).
Fusion — add task-structure feature to index-branch representation (after residual blocks) → ReLU → final linear layer → scalar weight $w_i$.
Default architecture hyperparameters
| Setting | $F_v$ | $\sigma$ | $d$ | $N_\text{blk}$ | $F_u$ | $h$ |
|---|---|---|---|---|---|---|
| MNIST (single-model) | 1024 | 100 | 64 | 2 | — | — |
| All other single-model | 2048 | 100 | 128 | 2 | — | — |
| Multi-model / Multi-task | 2048 | 100 | 128 | 2 | 32 | 4 |
| Recursive (generated UHNs) | 1024 | 100 | 64 | — | 32 | 4 |
The default UHN has 612,117 trainable parameters (non-MNIST single-model) or 158,613 (MNIST). Crucially, this count is independent of the target model size — unlike embedding-based baselines where $N_H = \Omega(\sqrt{N})$.
Supported base model types
UHN generates all trainable parameters for: Linear, 2D Convolution, GCN (Kipf & Welling, 2016), GAT (Veličković et al., 2017), Embedding, Multi-head Attention (MHA), and KAN (Liu et al., 2024) layers.
Generation modes
- Single-model, single-task: fix architecture/task descriptors, vary only index descriptors.
- Multi-model, single-task: fix task $\mathbf{t}$, vary architecture descriptors across models.
- Multi-task: vary both task $\mathbf{t}$ and architecture descriptors.
- Recursive: treat UHN itself as a target model; chain $H_0 \to H_1 \to \cdots \to H_K \to f$.
Training
Unified procedure
Each iteration: (1) sample target specification (architecture + task), (2) generate parameters via UHN, (3) evaluate task loss of resulting base model, (4) backpropagate through the entire differentiable generation path to update root parameters $\theta$.
Initialization phase (optional but critical)
Before main training, match generated parameter statistics to standard initializations. For each generated component $g$ with empirical mean $\mu(g)$ and std $\sigma(g)$:
\[\mathcal{L}\_\text{init} = \frac{1}{2 \mid \mathcal{G} \mid } \sum\_{g \in \mathcal{G}} \left[\left(\mu(g) - \mu^*\!(g)\right)^2 + \left(\sigma(g) - \sigma^*\!(g)\right)^2\right]\]Default targets match PyTorch/PyG initializations (e.g., linear weights: $\mu^=0$, $\sigma^=\frac{1}{\sqrt{3\,d_\text{in}}}$; GCN: $\sigma^*=\sqrt{\frac{2}{d_\text{in}+d_\text{out}}}$; KAN spline weights: $\mathcal{N}(0, 0.12)$).
For recursive chains, initialization proceeds top-down: allocate $S_\text{lvl}$ steps per level, with active level $k = \min(K, \lfloor\text{step}/S_\text{lvl}\rfloor)$. Disabling initialization causes training to diverge in recursive settings.
Optimization
- Optimizer: AdamW (no weight decay), cosine LR schedule with warmup.
- Hardware: single NVIDIA RTX 4090, PyTorch, AMP (FP16).
- Batch size: 256 for classification tasks; full-batch for formula regression.
- Warmup: 5 linear warmup epochs (single-/multi-model); 1000 warmup steps (multi-task/recursive).
Hyperparameter selection
Staged grid sweep: (1) fix training budget, sweep $\eta_\text{train}$ with no init; (2) fix training params, sweep $(S_\text{init}, \eta_\text{init})$; (3) fix init params, refine training params. Multi-task selection uses Borda count with per-task validation guardrails.
Evaluation
Single-model universality (Table 1)
One fixed UHN (612,117 params except MNIST at 158,613) vs. direct training across 11 task–model pairs:
| Task | Dataset | Model | Acc. (Direct) | Acc. (UHN) |
|---|---|---|---|---|
| Image | MNIST | MLP | 0.9837 ± 0.0007 | 0.9841 ± 0.0006 |
| Image | MNIST | CNN-8 | 0.9938 ± 0.0004 | 0.9944 ± 0.0010 |
| Image | CIFAR-10 | CNN-20 | 0.8999 ± 0.0027 | 0.8993 ± 0.0016 |
| Graph | Cora | GCN | 0.8172 ± 0.0039 | 0.7950 ± 0.0069 |
| Graph | PubMed | GCN | 0.7691 ± 0.0028 | 0.7815 ± 0.0075 |
| Graph | Cora | GAT | 0.8132 ± 0.0094 | 0.7981 ± 0.0091 |
| Text | AG News | Transformer-2L | 0.9186 ± 0.0008 | 0.9099 ± 0.0005 |
| Text | IMDB | Transformer-1L | 0.8853 ± 0.0007 | 0.8638 ± 0.0005 |
UHN matches or exceeds direct training on 4 of 11 settings. On formula regression (15 KAN special functions), UHN matches or improves RMSE on 9 of 15 functions.
Scalability (Table 2)
Fixed UHN (612,117 params) generating CNN-20/32/44/56 on CIFAR-10 vs. baselines:
| Model | #Params (Direct) | #Params (HA) | #Params (Chunked) | Acc. (Direct) | Acc. (UHN) |
|---|---|---|---|---|---|
| CNN-20 | 269K | 619K | 625K | 0.8999 | 0.8993 |
| CNN-56 | 851K | 668K | 632K | 0.9043 | 0.9069 |
UHN parameter count stays fixed at 612,117 while HA grows from 619K to 668K and Chunked from 625K to 632K.
Multi-model generalization (Table 3)
One UHN (663,151 params) trained on a model family, tested on held-out architectures:
| Family | #Models | Max Params | Seen Acc. | Unseen Acc. |
|---|---|---|---|---|
| CNN Mixed Depth | 100 | 463K | 0.8842 ± 0.0031 | 0.8430 ± 0.0023 |
| CNN Mixed Width | 500 | 1.07M | 0.9145 ± 0.0014 | 0.9145 ± 0.0013 |
| CNN Mixed Depth × Width | 1000 | 1.37M | 0.9038 ± 0.0012 | 0.9040 ± 0.0016 |
| Transformer Mixed | 1000 | 1.09M | 0.9063 ± 0.0004 | 0.9066 ± 0.0002 |
Three of four families show nearly identical seen/unseen accuracy. CNN Mixed Depth has a larger gap (driven by a single unseen outlier deeper than all training models).
Multi-task (Table 4)
One shared UHN across 6 heterogeneous tasks (vision, graph, text, formula regression):
| Task | Model | Perf. (Direct) | Perf. (UHN Single) | Perf. (UHN Multi) |
|---|---|---|---|---|
| MNIST | MLP | 0.9837 | 0.9841 | 0.9786 |
| CIFAR-10 | CNN-44 | 0.9076 | 0.9043 | 0.8927 |
| Cora | GCN | 0.8172 | 0.7950 | 0.7930 |
| PubMed | GAT | 0.7700 | 0.7801 | 0.7697 |
| AG News | Transformer-2L | 0.9186 | 0.9099 | 0.9062 |
| kv (RMSE) | KAN-g5 | 0.0211 | 0.0104 | 0.0172 |
Task sampling probabilities: CIFAR-10 (0.55), AG News (0.18), kv (0.11), MNIST (0.08), Cora (0.04), PubMed (0.04).
Recursive generation (Table 5)
$H_0 \to H_1 \to \cdots \to H_K \to f$ on MNIST MLP:
| Depth $K$ | Accuracy |
|---|---|
| 0 (no recursion) | 0.9841 ± 0.0006 |
| $K=1$ | 0.9825 ± 0.0007 |
| $K=2$ | 0.9795 ± 0.0011 |
| $K=3$ | 0.9741 ± 0.0021 |
Stable up to $K=3$ intermediate UHNs; gradual degradation expected from compounding approximation error.
Ablation highlights
- Index encoding (Table 23): Raw (0.6642) → Positional (0.8677) → GFF (0.8993). Gaussian Fourier features are essential.
- Capacity (Table 24): Increasing $F_v$ (256→4096: 0.8894→0.9018) and $d$ (32→256: 0.8900→0.9019) improve accuracy; depth beyond 1 block yields marginal gains.
- Task-structure encoder (Tables 25–26): Primarily stabilizes early training; marginal/mixed effect on final accuracy.
- Initialization (Tables 27–28): Improves convergence speed and stability; essential for recursive training (diverges without it).
Reproduction Guide
Environment
# Hardware: single NVIDIA RTX 4090
# Framework: PyTorch with AMP (FP16)
pip install torch torchvision torch_geometric
git clone https://github.com/Xuanfeng-Zhou/UHN.git
cd UHN
Single-model: CIFAR-10 CNN-20
# Train direct baseline (for comparison)
python train_direct.py --dataset cifar10 --model cnn20 \
--lr 0.005 --epochs 400 --warmup_epochs 5 \
--seed 0
# Train UHN (best hyperparams from Table 13: no init, lr=2e-4, 800 epochs)
python train_uhn.py --dataset cifar10 --model cnn20 \
--Fv 2048 --d 128 --Nblk 2 --sigma 100 \
--init_steps 0 --init_lr 0 \
--train_lr 2e-4 --train_epochs 800 \
--warmup_epochs 5 --batch_size 256 \
--seed 0
Multi-model: CNN Mixed Width
# Best hyperparams (Table 19): init_lr=1e-4, Sinit=12800, train_lr=1e-4, Etrain=3200
python train_uhn_multi.py --family cnn_mixed_width \
--Fu 32 --heads 4 \
--init_lr 1e-4 --init_steps 12800 \
--train_lr 1e-4 --train_epochs 3200 \
--warmup_epochs 5 --batch_size 256 \
--seed 0
Multi-task (6 tasks)
# Best hyperparams (Table 21): init_lr=1e-4, Sinit=500, train_lr=2e-5, Strain=200000
python train_uhn_multitask.py \
--tasks mnist.mlp,cifar10.cnn44,cora.gcn,pubmed.gat,agnews.transformer2l,kv.kang5 \
--Fu 32 --heads 4 \
--init_lr 1e-4 --init_steps 500 \
--train_lr 2e-5 --train_steps 200000 \
--warmup_steps 1000 --batch_size 256 \
--seed 0
Recursive (depth K=1)
# Best hyperparams (Table 22): init_lr=1e-4, Sinit=4000, train_lr=2e-5, Strain=30000
python train_uhn_recursive.py --depth 1 \
--init_lr 1e-4 --init_steps 4000 \
--train_lr 2e-5 --train_steps 30000 \
--warmup_steps 1000 --grad_clip 0.01 \
--seed 0
Key hyperparameter sweep grids
- $\eta_\text{init} \in {5!\times!10^{-5}, 10^{-4}, 2!\times!10^{-4}}$, $S_\text{init} \in {50, 100, 200}$ (single-model)
- $\eta_\text{train} \in {2!\times!10^{-5}, 5!\times!10^{-5}, 10^{-4}}$ (most settings)
- Multi-task guardrails: MNIST ≥ 0.95, CIFAR-10 ≥ 0.85, Cora ≥ 0.75, PubMed ≥ 0.75, AG News ≥ 0.85, kv RMSE ≤ 5e-2
Notes
- Core insight: By modeling weights as a function of deterministic descriptors (index + architecture + task) rather than using learned per-layer embeddings, UHN achieves architecture-agnostic weight generation with a fixed generator parameter count independent of target model size.
- Descriptor design is manual: The 10/6/21/2-dimensional descriptor layouts are hand-crafted for the supported layer types (Linear, Conv, GCN, GAT, Embedding, MHA, KAN). Extending to new layer types requires designing new descriptor fields.
- No non-trainable state generation: UHN does not generate BatchNorm running statistics, limiting compatibility with some architectures that rely on batch normalization.
- Compute overhead: Training through the full differentiable generation path adds optimization cost and memory usage; UHN typically needs longer training schedules than direct training (e.g., 800 epochs vs. 400 for CIFAR-10 CNN-20).
- Task-structure encoder is auxiliary: Ablations show its effect on final accuracy is small and task-dependent; UHN’s core generality comes from the shared index-based generator.
- Initialization is essential for recursion: Without the statistics-matching init phase, recursive training diverges due to exploding activations and unstable gradients in the generation chain.
- Scaling to larger models (beyond ~1.37M target params) and deeper recursion ($K > 3$) are noted as open challenges requiring depth-specific optimization choices.
- Cross-task interference in multi-task training (e.g., kv RMSE degrading from 0.0104 single-task to 0.0172 multi-task) suggests gradient surgery (Yu et al., 2020) or task-adaptive sampling as promising mitigations.