Universal Hypernetworks for Arbitrary Models

Problem

Conventional hypernetworks (Ha et al., 2016; Von Oswald et al., 2019; Chen & Wang, 2022; Hedlin et al., 2025) are architecturally coupled to their target base models. Their output space — the number of per-layer/per-chunk learned embeddings, specialized output heads, or overall output dimensionality — is tied to the base-network architecture. Concretely, for a chunked hypernetwork (Von Oswald et al., 2019) that partitions $N$ target parameters into $N_\text{chunks}$ chunks of size $c$, the total hypernetwork parameter count satisfies

\[N\_H = N\_{H,0} + oc + N\_\text{chunks} \cdot d\_\text{emb},\]

which by the AM–GM inequality yields $N_H = \Omega(\sqrt{N})$. Changing the target model family typically requires redesigning and retraining the hypernetwork from scratch.

Additional specific limitations of prior methods:

Ha et al. (2016) — generates only convolution kernels (not all parameters); embedding count grows with model size.
Von Oswald et al. (2019) — chunked generation with learned per-chunk embeddings that scale with target size.
Chen & Wang (2022); Navon et al. (2020) — architecture-coupled output parameterization forces redesign when the target model changes.
Zhou et al. (2024); Kim et al. (2026) — output space sized for the largest architecture in the family, leading to wasted outputs or ad-hoc masking/truncation for smaller models.
Hedlin et al. (2025) — learned per-block embedding sets whose number grows with generated blocks.
Knyazev et al. (2021, 2023) — generated weights underperform direct training and require fine-tuning before deployment.

Recursive hypernetwork generation has been rarely explored and is challenging due to scaling and initialization instability (Liao et al., 2023; Lutati & Wolf, 2021).

Architecture

UHN is a fixed-architecture generator $H_\theta$ that predicts each scalar weight $w_i$ of a base model with $N$ parameters from deterministic descriptors:

\[w\_i = H\_\theta\!\left(\mathbf{v}\_i,\; \mathbf{s}\_g,\; \{\mathbf{s}\_{\ell,j}\}\_{j=1}^{L},\; \mathbf{t}\right)\]

Key principle: all target-specificity moves into the conditioning inputs, not the generator architecture.

Descriptors (4 types)

Descriptor	Dim	Content
Index $\mathbf{v}_i \in \mathbb{R}^{10}$	10	layer idx, layer type, param type, out idx, in idx, kernel h/w idx, embedding idx, sequence idx, grid idx
Global structure $\mathbf{s}_g \in \mathbb{R}^{6}$	6	model type, num layers, cnn stage num, num encoders, num structure/index freqs (recursive)
Local structure $\mathbf{s}_{\ell,j} \in \mathbb{R}^{21}$	21	per-layer: bias/norm/shortcut/activation type, I/O sizes, dropout, group num, kernel size, num heads, grid size, spline order, etc.
Task $\mathbf{t} \in \mathbb{R}^{2}$	2	task type, dataset type

Stage 1: Encode descriptors

Each descriptor vector $x$ is attribute-wise normalized (standardizing a uniform distribution) to $\hat{x}$, then mapped via Gaussian Fourier features (Tancik et al., 2020):

\[\gamma\_B(\hat{x}) = \left[\cos(B\hat{x})^\top,\; \sin(B\hat{x})^\top\right]^\top \in \mathbb{R}^{2m}\]

where $B \in \mathbb{R}^{m \times n}$ has entries sampled i.i.d. from $\mathcal{N}(0, \sigma^2)$ and kept fixed. For index descriptors: $\phi_i = \gamma_{B_v}(\hat{\mathbf{v}}_i)$; for per-layer task-structure descriptors $\mathbf{u}_j = [\mathbf{s}_g; \mathbf{t}; \mathbf{s}_{\ell,j}]$: $\psi_j = \gamma_{B_u}(\hat{\mathbf{u}}_j)$.

Stage 2: Map encodings to weights

Index branch — MLP with pre-activation residual blocks (He et al., 2016): input linear (width $d$) → 2 residual blocks (each: ReLU → LayerNorm → Linear with width $d$) → shortcut connections.

Task-structure encoder (optional) — single-layer Transformer encoder (Vaswani et al., 2017) with $h$ heads applied to ${\psi_j}_{j=1}^{L}$ → mean pooling → 2-layer MLP (Linear → ReLU → Linear, width $d$; last linear zero-initialized).

Fusion — add task-structure feature to index-branch representation (after residual blocks) → ReLU → final linear layer → scalar weight $w_i$.

Default architecture hyperparameters

Setting	$F_v$	$\sigma$	$d$	$N_\text{blk}$	$F_u$	$h$
MNIST (single-model)	1024	100	64	2	—	—
All other single-model	2048	100	128	2	—	—
Multi-model / Multi-task	2048	100	128	2	32	4
Recursive (generated UHNs)	1024	100	64	—	32	4

The default UHN has 612,117 trainable parameters (non-MNIST single-model) or 158,613 (MNIST). Crucially, this count is independent of the target model size — unlike embedding-based baselines where $N_H = \Omega(\sqrt{N})$.

Supported base model types

UHN generates all trainable parameters for: Linear, 2D Convolution, GCN (Kipf & Welling, 2016), GAT (Veličković et al., 2017), Embedding, Multi-head Attention (MHA), and KAN (Liu et al., 2024) layers.

Generation modes

Single-model, single-task: fix architecture/task descriptors, vary only index descriptors.
Multi-model, single-task: fix task $\mathbf{t}$, vary architecture descriptors across models.
Multi-task: vary both task $\mathbf{t}$ and architecture descriptors.
Recursive: treat UHN itself as a target model; chain $H_0 \to H_1 \to \cdots \to H_K \to f$.

Training

Unified procedure

Each iteration: (1) sample target specification (architecture + task), (2) generate parameters via UHN, (3) evaluate task loss of resulting base model, (4) backpropagate through the entire differentiable generation path to update root parameters $\theta$.

Initialization phase (optional but critical)

Before main training, match generated parameter statistics to standard initializations. For each generated component $g$ with empirical mean $\mu(g)$ and std $\sigma(g)$:

\[\mathcal{L}\_\text{init} = \frac{1}{2 \mid \mathcal{G} \mid } \sum\_{g \in \mathcal{G}} \left[\left(\mu(g) - \mu^*\!(g)\right)^2 + \left(\sigma(g) - \sigma^*\!(g)\right)^2\right]\]

Default targets match PyTorch/PyG initializations (e.g., linear weights: $\mu^=0$, $\sigma^=\frac{1}{\sqrt{3\,d_\text{in}}}$; GCN: $\sigma^*=\sqrt{\frac{2}{d_\text{in}+d_\text{out}}}$; KAN spline weights: $\mathcal{N}(0, 0.12)$).

For recursive chains, initialization proceeds top-down: allocate $S_\text{lvl}$ steps per level, with active level $k = \min(K, \lfloor\text{step}/S_\text{lvl}\rfloor)$. Disabling initialization causes training to diverge in recursive settings.

Optimization

Optimizer: AdamW (no weight decay), cosine LR schedule with warmup.
Hardware: single NVIDIA RTX 4090, PyTorch, AMP (FP16).
Batch size: 256 for classification tasks; full-batch for formula regression.
Warmup: 5 linear warmup epochs (single-/multi-model); 1000 warmup steps (multi-task/recursive).

Hyperparameter selection

Staged grid sweep: (1) fix training budget, sweep $\eta_\text{train}$ with no init; (2) fix training params, sweep $(S_\text{init}, \eta_\text{init})$; (3) fix init params, refine training params. Multi-task selection uses Borda count with per-task validation guardrails.

Evaluation

Single-model universality (Table 1)

One fixed UHN (612,117 params except MNIST at 158,613) vs. direct training across 11 task–model pairs:

Task	Dataset	Model	Acc. (Direct)	Acc. (UHN)
Image	MNIST	MLP	0.9837 ± 0.0007	0.9841 ± 0.0006
Image	MNIST	CNN-8	0.9938 ± 0.0004	0.9944 ± 0.0010
Image	CIFAR-10	CNN-20	0.8999 ± 0.0027	0.8993 ± 0.0016
Graph	Cora	GCN	0.8172 ± 0.0039	0.7950 ± 0.0069
Graph	PubMed	GCN	0.7691 ± 0.0028	0.7815 ± 0.0075
Graph	Cora	GAT	0.8132 ± 0.0094	0.7981 ± 0.0091
Text	AG News	Transformer-2L	0.9186 ± 0.0008	0.9099 ± 0.0005
Text	IMDB	Transformer-1L	0.8853 ± 0.0007	0.8638 ± 0.0005

UHN matches or exceeds direct training on 4 of 11 settings. On formula regression (15 KAN special functions), UHN matches or improves RMSE on 9 of 15 functions.

Scalability (Table 2)

Fixed UHN (612,117 params) generating CNN-20/32/44/56 on CIFAR-10 vs. baselines:

Model	#Params (Direct)	#Params (HA)	#Params (Chunked)	Acc. (Direct)	Acc. (UHN)
CNN-20	269K	619K	625K	0.8999	0.8993
CNN-56	851K	668K	632K	0.9043	0.9069

UHN parameter count stays fixed at 612,117 while HA grows from 619K to 668K and Chunked from 625K to 632K.

Multi-model generalization (Table 3)

One UHN (663,151 params) trained on a model family, tested on held-out architectures:

Family	#Models	Max Params	Seen Acc.	Unseen Acc.
CNN Mixed Depth	100	463K	0.8842 ± 0.0031	0.8430 ± 0.0023
CNN Mixed Width	500	1.07M	0.9145 ± 0.0014	0.9145 ± 0.0013
CNN Mixed Depth × Width	1000	1.37M	0.9038 ± 0.0012	0.9040 ± 0.0016
Transformer Mixed	1000	1.09M	0.9063 ± 0.0004	0.9066 ± 0.0002

Three of four families show nearly identical seen/unseen accuracy. CNN Mixed Depth has a larger gap (driven by a single unseen outlier deeper than all training models).

Multi-task (Table 4)

One shared UHN across 6 heterogeneous tasks (vision, graph, text, formula regression):

Task	Model	Perf. (Direct)	Perf. (UHN Single)	Perf. (UHN Multi)
MNIST	MLP	0.9837	0.9841	0.9786
CIFAR-10	CNN-44	0.9076	0.9043	0.8927
Cora	GCN	0.8172	0.7950	0.7930
PubMed	GAT	0.7700	0.7801	0.7697
AG News	Transformer-2L	0.9186	0.9099	0.9062
kv (RMSE)	KAN-g5	0.0211	0.0104	0.0172

Task sampling probabilities: CIFAR-10 (0.55), AG News (0.18), kv (0.11), MNIST (0.08), Cora (0.04), PubMed (0.04).

Recursive generation (Table 5)

$H_0 \to H_1 \to \cdots \to H_K \to f$ on MNIST MLP:

Depth $K$	Accuracy
0 (no recursion)	0.9841 ± 0.0006
$K=1$	0.9825 ± 0.0007
$K=2$	0.9795 ± 0.0011
$K=3$	0.9741 ± 0.0021

Stable up to $K=3$ intermediate UHNs; gradual degradation expected from compounding approximation error.

Ablation highlights

Index encoding (Table 23): Raw (0.6642) → Positional (0.8677) → GFF (0.8993). Gaussian Fourier features are essential.
Capacity (Table 24): Increasing $F_v$ (256→4096: 0.8894→0.9018) and $d$ (32→256: 0.8900→0.9019) improve accuracy; depth beyond 1 block yields marginal gains.
Task-structure encoder (Tables 25–26): Primarily stabilizes early training; marginal/mixed effect on final accuracy.
Initialization (Tables 27–28): Improves convergence speed and stability; essential for recursive training (diverges without it).

Reproduction Guide

Environment

# Hardware: single NVIDIA RTX 4090
# Framework: PyTorch with AMP (FP16)
pip install torch torchvision torch_geometric
git clone https://github.com/Xuanfeng-Zhou/UHN.git
cd UHN

Single-model: CIFAR-10 CNN-20

# Train direct baseline (for comparison)
python train_direct.py --dataset cifar10 --model cnn20 \
  --lr 0.005 --epochs 400 --warmup_epochs 5 \
  --seed 0

# Train UHN (best hyperparams from Table 13: no init, lr=2e-4, 800 epochs)
python train_uhn.py --dataset cifar10 --model cnn20 \
  --Fv 2048 --d 128 --Nblk 2 --sigma 100 \
  --init_steps 0 --init_lr 0 \
  --train_lr 2e-4 --train_epochs 800 \
  --warmup_epochs 5 --batch_size 256 \
  --seed 0

Multi-model: CNN Mixed Width

# Best hyperparams (Table 19): init_lr=1e-4, Sinit=12800, train_lr=1e-4, Etrain=3200
python train_uhn_multi.py --family cnn_mixed_width \
  --Fu 32 --heads 4 \
  --init_lr 1e-4 --init_steps 12800 \
  --train_lr 1e-4 --train_epochs 3200 \
  --warmup_epochs 5 --batch_size 256 \
  --seed 0

Multi-task (6 tasks)

# Best hyperparams (Table 21): init_lr=1e-4, Sinit=500, train_lr=2e-5, Strain=200000
python train_uhn_multitask.py \
  --tasks mnist.mlp,cifar10.cnn44,cora.gcn,pubmed.gat,agnews.transformer2l,kv.kang5 \
  --Fu 32 --heads 4 \
  --init_lr 1e-4 --init_steps 500 \
  --train_lr 2e-5 --train_steps 200000 \
  --warmup_steps 1000 --batch_size 256 \
  --seed 0

Recursive (depth K=1)

# Best hyperparams (Table 22): init_lr=1e-4, Sinit=4000, train_lr=2e-5, Strain=30000
python train_uhn_recursive.py --depth 1 \
  --init_lr 1e-4 --init_steps 4000 \
  --train_lr 2e-5 --train_steps 30000 \
  --warmup_steps 1000 --grad_clip 0.01 \
  --seed 0

Key hyperparameter sweep grids

$\eta_\text{init} \in {5!\times!10^{-5}, 10^{-4}, 2!\times!10^{-4}}$, $S_\text{init} \in {50, 100, 200}$ (single-model)
$\eta_\text{train} \in {2!\times!10^{-5}, 5!\times!10^{-5}, 10^{-4}}$ (most settings)
Multi-task guardrails: MNIST ≥ 0.95, CIFAR-10 ≥ 0.85, Cora ≥ 0.75, PubMed ≥ 0.75, AG News ≥ 0.85, kv RMSE ≤ 5e-2

Notes

Core insight: By modeling weights as a function of deterministic descriptors (index + architecture + task) rather than using learned per-layer embeddings, UHN achieves architecture-agnostic weight generation with a fixed generator parameter count independent of target model size.
Descriptor design is manual: The 10/6/21/2-dimensional descriptor layouts are hand-crafted for the supported layer types (Linear, Conv, GCN, GAT, Embedding, MHA, KAN). Extending to new layer types requires designing new descriptor fields.
No non-trainable state generation: UHN does not generate BatchNorm running statistics, limiting compatibility with some architectures that rely on batch normalization.
Compute overhead: Training through the full differentiable generation path adds optimization cost and memory usage; UHN typically needs longer training schedules than direct training (e.g., 800 epochs vs. 400 for CIFAR-10 CNN-20).
Task-structure encoder is auxiliary: Ablations show its effect on final accuracy is small and task-dependent; UHN’s core generality comes from the shared index-based generator.
Initialization is essential for recursion: Without the statistics-matching init phase, recursive training diverges due to exploding activations and unstable gradients in the generation chain.
Scaling to larger models (beyond ~1.37M target params) and deeper recursion ($K > 3$) are noted as open challenges requiring depth-specific optimization choices.
Cross-task interference in multi-task training (e.g., kv RMSE degrading from 0.0104 single-task to 0.0172 multi-task) suggests gradient surgery (Yu et al., 2020) or task-adaptive sampling as promising mitigations.