2026-03-29

Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML

Yassien Shaalan

TinyML compression embedded MCU edge

problem

Deploying neural networks on microcontrollers (MCUs) is fundamentally constrained by kilobytes of flash and SRAM. Even after INT8 quantization, 1×1 pointwise (PW) convolutions — which serve as channel mixers in depthwise-separable CNNs — dominate the stored model footprint. In a typical TinyML CNN for ECG or speech classification, PW layers account for the vast majority of parameter bytes.

Prior compression approaches fall short for MCU deployment:

  • Structured pruning (e.g., ThiNet, AMC) removes entire filters or channels but still requires storing the remaining weights in full, and accuracy drops sharply under aggressive sparsity targets on kilobyte budgets.
  • Knowledge distillation (TinyBERT, MobileNetV3-style) requires a pretrained teacher and extensive retraining; the compressed student still stores all weights explicitly.
  • Weight sharing / hashing (BobHash, Deep Compression quantization) reduces unique stored values but incurs lookup overhead and address storage that negates savings at the extreme compression ratios MCUs demand.
  • Standard hypernetworks generate weights from a shared network but the hypernetwork itself is too large for MCU flash; they were designed for GPU inference, not kilobyte-constrained embedded targets.

The core challenge: MCUs like STM32H7 have 256 KB flash and 64 KB SRAM. A TinyML CNN with INT8 weights can reach ~1.4 MB — well over 5× the flash budget. The gap between what standard compression can achieve and what MCUs require motivates a fundamentally different approach.

architecture

flowchart LR
    inp[input features x_l] --> code[layer code z_l]
    code --> H[shared hypernetwork H]
    H --> gen[generated weights W_hat_l]
    gen --> conv[1x1 conv layer]
    inp2[input features] --> conv
    conv --> out[output features]
    
    style code fill:#c4b8a6,color:#fff
    style H fill:#c4b8a6,color:#fff
    style gen fill:#b09a84,color:#fff

HyperTinyPW replaces most stored PW weights with weights generated at load time by a shared micro-hypernetwork. The key insight: PW layers across a CNN share significant structural redundancy. A single lightweight hypernetwork, conditioned on tiny per-layer latent codes, can synthesize each layer’s channel-mixing kernel with high fidelity.

base model: depthwise-separable cnn

The backbone is a depthwise-separable convolutional network. For layer $l$ with input channels $C_{\text{in}}^{(l)}$ and output channels $C_{\text{out}}^{(l)}$:

\[\mathbf{Y}^{(l)} = \text{PW}^{(l)}\left(\text{DW}^{(l)}(\mathbf{X}^{(l)})\right)\]

where $\text{DW}^{(l)}$ is a depthwise 3×3 or 1×1 convolution and $\text{PW}^{(l)}$ is a 1×1 pointwise convolution with kernel $\mathbf{W}^{(l)} \in \mathbb{R}^{C_{\text{out}}^{(l)} \times C_{\text{in}}^{(l)}}$.

The PW kernel dominates memory. For a layer mapping 128 → 256 channels: $\mid \mathbf{W}^{(l)} \mid = 256 \times 128 = 32{,}768$ INT8 values = 32 KB per layer. With 5–8 PW layers, this exceeds the entire MCU flash budget.

weight generation via shared hypernetwork

Instead of storing $\mathbf{W}^{(l)}$, HyperTinyPW generates it:

\[\hat{\mathbf{W}}^{(l)} = \mathcal{H}(\mathbf{z}^{(l)}; \theta)\]

where:

  • $\mathcal{H}$ is the shared micro-hypernetwork (a small MLP) with parameters $\theta$
  • $\mathbf{z}^{(l)} \in \mathbb{R}^{d_z}$ is the per-layer latent code
  • $d_z \ll C_{\text{out}}^{(l)} \times C_{\text{in}}^{(l)}$, typically a few bytes per layer

The hypernetwork $\mathcal{H}$ is shared across all PW layers (the “once-for-all” property). Only one copy of $\theta$ is stored, plus one $\mathbf{z}^{(l)}$ per layer. The first PW layer is kept as stored INT8 weights (not generated) for training stability and to anchor the feature representation.

compression ratio

The compression ratio is:

\[\text{CR} = \frac{\displaystyle\sum_{l=1}^{L} \mid \mathbf{W}^{(l)} \mid }{ \mid \theta \mid + \mid \mathbf{W}^{(1)} \mid + \displaystyle\sum_{l=2}^{L} \mid \mathbf{z}^{(l)} \mid }\]

The denominator includes the hypernetwork parameters, the first PW layer (stored), and all per-layer codes. With a micro-MLP of a few hundred parameters and codes of dimension $d_z \approx 8$–$16$, the overhead is negligible compared to the PW weight savings.

load-time generation and caching

Weight generation is a one-time operation at model load time:

  1. For each compressed layer $l \in {2, \ldots, L}$, feed $\mathbf{z}^{(l)}$ into $\mathcal{H}$ to produce $\hat{\mathbf{W}}^{(l)}$
  2. Quantize generated weights to INT8
  3. Cache all generated weights in SRAM alongside $\mathbf{W}^{(1)}$
  4. Run standard INT8 inference with the cached weight set

Steady-state inference latency matches the INT8 baseline exactly — the generation cost is amortized entirely at load time.

packed-byte memory accounting

The paper uses TinyML-faithful memory accounting: every parameter, activation buffer, and lookup table is counted in packed bytes as it would appear in MCU flash/SRAM. No half-precision or floating-point fudge factors.

training

Training proceeds in phases:

phase 1: full model pretraining

Train the depthwise-separable CNN to convergence on the target task using standard cross-entropy loss:

\[\mathcal{L}_{\text{task}} = -\sum_{i=1}^{N} y_i \log \hat{y}_i\]

This establishes the baseline accuracy and provides the target weights $\mathbf{W}^{(l)}$ that the hypernetwork must reconstruct.

phase 2: hypernetwork training with reconstruction

Learn the shared hypernetwork parameters $\theta$ and per-layer codes $\mathbf{z}^{(l)}$ by minimizing a combined task + reconstruction loss:

\[\mathcal{L} = \mathcal{L}_{\text{task}}(\hat{\mathbf{Y}}, \mathbf{Y}) + \lambda \sum_{l=2}^{L} \left\| \mathbf{W}^{(l)} - \hat{\mathbf{W}}^{(l)} \right\|_F^2\]

where $\hat{\mathbf{W}}^{(l)} = \mathcal{H}(\mathbf{z}^{(l)}; \theta)$ and $\lambda$ controls the reconstruction fidelity penalty. The Frobenius norm on weight matrices directly penalizes deviation from the pretrained weights.

The first PW layer weights $\mathbf{W}^{(1)}$ remain frozen (stored as INT8, not generated).

phase 3: end-to-end fine-tuning

After the hypernetwork converges, perform end-to-end fine-tuning with the generated weights in the loop. This recovers task accuracy that may have been lost during compression. The latent codes $\mathbf{z}^{(l)}$ and hypernetwork parameters $\theta$ are updated jointly.

practical considerations

  • The hypernetwork is deliberately kept tiny (micro-MLP with a few hundred parameters) so it does not itself become a storage bottleneck.
  • Per-layer codes are of dimension $d_z$ in the range of 8–16, making each code 8–16 bytes — negligible compared to the thousands of bytes of PW weights being replaced.
  • Quantization-aware training can be applied in phase 3 to account for the INT8 quantization of generated weights on the target MCU.

evaluation

ecg classification

Evaluated on standard clinical ECG benchmarks:

Dataset Task Baseline (INT8) HyperTinyPW Compression
Apnea-ECG Apnea detection ~1.4 MB model 225 KB 6.31×
PTB-XL Multi-label ECG
MIT-BIH Arrhythmia

At the 225 KB budget, HyperTinyPW matches the full 1.4 MB INT8 CNN, achieving a 6.31× compression ratio and 84.15% reduction in stored bytes. Macro-F1 retention is approximately 95% of the uncompressed model’s accuracy.

Under extreme budgets of 32–64 KB, HyperTinyPW sustains balanced detection performance across classes, while baselines (structured pruning, simple quantization) degrade catastrophically on minority classes.

speech commands

Transferred to audio TinyML on Google Speech Commands (12-class keyword spotting):

  • Achieves 96.2% accuracy on Speech Commands
  • Demonstrates that the compression-as-generation approach generalizes beyond ECG/time-series to audio spectrogram features

mcu deployment metrics

Target hardware class: STM32H7 or equivalent MCUs with:

  • 256 KB flash
  • 64 KB SRAM
  • ARM Cortex-M7 core
  • No hardware floating-point unit required (INT8 inference path)

Key property: since weight generation is a one-time load-time operation, steady-state inference latency is identical to the INT8 baseline. The generation cost is paid once when the model is loaded, after which standard integer operators execute the cached weights. This makes HyperTinyPW practical for real-time sensing applications where latency budgets are tight.

reproduction guide

prerequisites

  • Python 3.9+ with PyTorch
  • PhysioNet datasets (Apnea-ECG, PTB-XL, MIT-BIH) from physionet.org
  • Google Speech Commands from kaggle/google-speech-commands
  • No public code repo is listed — implementation must be built from the paper’s description

step 1: data setup

# ECG datasets from PhysioNet
wget -r -N -c -np --user YOUR_EMAIL --ask-password \
  https://physionet.org/physiobank/database/apnea-ecg/
wget -r -N -c -np --user YOUR_EMAIL --ask-password \
  https://physionet.org/files/ptb-xl/
wget -r -N -c -np --user YOUR_EMAIL --ask-password \
  https://physionet.org/files/mitdb/

Preprocess ECG signals: resample to 250 Hz, segment into fixed-length windows (e.g., 5–10 seconds), normalize per-lead.

For Speech Commands: download, extract 1-second audio clips, compute log-mel spectrograms (typically 40–64 mel bins, 16 ms frame hop).

step 2: train baseline depthwise-separable cnn

  1. Define a depthwise-separable CNN appropriate for input dimensions (1D for ECG, 2D for spectrograms)
  2. Train with Adam optimizer, learning rate $10^{-3}$, batch size 64–128
  3. Apply INT8 quantization (PyTorch quantization-aware training or post-training static quantization)
  4. Record baseline macro-F1 and model size in packed INT8 bytes
  5. Expected baseline: ~1.4 MB for a 6–8 layer network on ECG tasks

step 3: implement the micro-hypernetwork

  1. Freeze $\mathbf{W}^{(1)}$ (first PW layer)
  2. Define $\mathcal{H}$: a small MLP (e.g., Linear($d_z$, 64) → ReLU → Linear(64, $C_{\text{out}}^{(l)} \times C_{\text{in}}^{(l)}$))
  3. Initialize per-layer codes $\mathbf{z}^{(l)} \sim \mathcal{N}(0, \sigma^2 \mathbf{I})$ for $l \in {2, \ldots, L}$
  4. Use a separate output head per layer (or reshape a shared output projection) to handle varying $C_{\text{out}}^{(l)} \times C_{\text{in}}^{(l)}$ dimensions

step 4: hypernetwork training

  1. Initialize $\theta$ and $\mathbf{z}^{(l)}$ to match pretrained weights via SVD-based initialization (optional but helpful)
  2. Train with $\mathcal{L} = \mathcal{L}_{\text{task}} + \lambda |\mathbf{W}^{(l)} - \hat{\mathbf{W}}^{(l)}|_F^2$, $\lambda \approx 0.1$–$1.0$
  3. Optimizer: Adam with learning rate $10^{-4}$ for $\theta$, $10^{-3}$ for $\mathbf{z}^{(l)}$
  4. Monitor both task accuracy and weight reconstruction error

step 5: mcu deployment simulation

  1. Count packed bytes: $\mid \theta \mid + \mid \mathbf{W}^{(1)} \mid + \sum_{l=2}^{L} \mid \mathbf{z}^{(l)} \mid$ in INT8-equivalent storage
  2. Simulate load-time generation: run $\mathcal{H}$ for each layer, quantize outputs to INT8
  3. Verify inference accuracy matches on-device expectations
  4. Target: 225 KB total footprint with 6.31× compression

expected results

  • Compression ratio: 6.31× at 225 KB (matching the 1.4 MB baseline)
  • Macro-F1 retention: ~95% of uncompressed model
  • Speech Commands accuracy: 96.2%
  • Steady-state latency: identical to INT8 baseline

gotchas

  • The first PW layer must be kept stored, not generated — removing it causes training instability and significant accuracy degradation.
  • Code dimension $d_z$ is critical: too small and reconstruction quality suffers; too large and compression ratio decreases. Start with $d_z = 16$.
  • Weight reconstruction loss $\lambda$ needs careful tuning — too high and the hypernetwork overfits to pretrained weights without adapting to the task; too low and generated weights diverge.
  • No public code is available — all implementation must be done from scratch based on the paper.

notes

Compression-as-generation is practical, not just theoretical. The key result is that 225 KB of stored data can replace 1.4 MB of INT8 weights while retaining 95% accuracy. This is a fundamental shift from “compress weights” to “don’t store weights at all.” The steady-state inference matching INT8 baseline latency makes this deployable today on STM32-class hardware.

Cross-layer redundancy is the low-hanging fruit. The shared hypernetwork explicitly exploits that PW kernels across layers have similar structure. This is complementary to other compression techniques — you could combine HyperTinyPW with structured pruning or activation quantization for even more aggressive compression.

The load-time generation assumption is reasonable but not free. Generating weights at load time adds startup latency proportional to the hypernetwork forward pass for each compressed layer. For a 6–8 layer model this is a few milliseconds — acceptable for always-on sensing but potentially problematic for hard real-time startup requirements.

Applicability beyond PW layers. The natural question is whether this extends to attention layers (Q, K, V projection matrices in transformers) or even depthwise convolutions. Attention projections have similar structure to PW convolutions (linear channel mixing), making them a natural next target. This could make LLMs on tiny devices more feasible by aggressively compressing the non-attention layers first, then the attention projections.

Weaknesses. No public code makes independent verification difficult. The evaluation is limited to ECG and speech — computer vision tasks (CIFAR, ImageNet subsets on MCUs) would strengthen the generality claim. The paper doesn’t report load-time latency numbers explicitly, which matters for deployment.

Connection to bopi research. Directly relevant to embedded AI on kilobyte-constrained devices. The 6.31× compression ratio at 95% accuracy retention is a strong result that changes what’s feasible on a 256 KB flash MCU. Combined with quantization and pruning, this could enable CNN architectures that were previously impossible on bare-metal microcontrollers.