2026-03-31
CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities
Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni
problem
efficient vision backbone design is dominated by MACs (multiply-accumulate operations) as the primary efficiency metric. but MACs are a poor predictor of actual latency on CPUs, where limited parallelism (4-8 cores on a raspberry pi, 2-4 on an MCU) means total compute, not throughput, is the bottleneck. architectures optimized for GPUs and mobile NPUs (FasterNet, MobileNetV4, RepViT) often perform poorly on bare CPUs because their high MAC counts overwhelm limited compute budgets, even though they achieve high MACs-per-second (MACpS) on parallel hardware.
the core insight: on CPUs, latency $\propto \text{MACs} / \text{MACpS}$. you need both low total MACs AND high per-second throughput. depthwise convolutions (DWConv) achieve low MACs but terrible MACpS (poor hardware utilization). standard convolutions have high MACpS but too many MACs. the sweet spot: grouped convolutions and small kernels.
architecture
two new MBConv variants form the building blocks:
Grouped Fused MBConv (GrFuMBConv): sets groups=2 in the expansion convolution of a fused MBConv, with 2$\times$2 kernels. expansion factor = 4. structure: GroupedConv(groups=2, k=2$\times$2) $\to$ PWConv $\to$ (BN + GELU). achieves 45% fewer MACs than standard FuMBConv with essentially identical MACpS on CPU. the MAC reduction ratio is always exactly 0.55, independent of channel dimension.
Grouped MBConv (GrMBConv): sets groups=2 in the expansion convolution of a standard MBConv, with 2$\times$2 kernels. structure: GroupedConv(groups=2, k=2$\times$2) $\to$ DWConv(k=2$\times$2) $\to$ PWConv. achieves ~24% fewer MACs than MBConv, with only ~5% lower MACpS on CPU.
design rules discovered through systematic measurement:
| channel range | preferred variant | why |
|---|---|---|
| < 256 | fused (GrFuMBConv) | ~70% higher MACpS than unfused |
| >= 256 | unfused (GrMBConv) | ~27% higher MACpS, better FLOP utilization |
2$\times$2 kernels give ~42% higher MACpS for DWConv, ~5% for FuMBConv, ~12% for GrFuMBConv on ARM CPUs, because smaller kernels fit better in L1/L2 cache and reduce memory bandwidth pressure.
CPUBone macro design:
flowchart LR
S[stem: conv3x3<br/>stride 2] --> S1["stage 1<br/>GrFuMBConv<br/>low channels"]
S1 --> S2["stage 2<br/>GrFuMBConv<br/>medium channels"]
S2 --> S3["stage 3<br/>GrMBConv<br/>high channels"]
S3 --> S4["stage 4<br/>GrMBConv + attention<br/>highest channels"]
- 4 stages with stride-2 down-sampling at boundaries
- early stages use fused variants (low channels, benefit from MACpS)
- later stages use unfused variants (high channels, better FLOP utilization)
- LowFormer attention in the final stage with nearest-neighbor upsampling (not transpose conv – ~5% higher MACpS, significant latency improvement)
- stem: Conv2D 3$\times$3, stride 2
four model sizes:
| model | params | MACs | top-1 |
|---|---|---|---|
| B0 | 5.4M | 562M | 78.7% |
| B1 | 12.4M | 746M | 78.7% |
| B2 | 24.8M | 1527M | 82.8% |
| B3 | 44.9M | 2977M | 84.1% |
training
- hardware: LEONARDO supercomputer (EuroHPC), GPU cluster (likely A100)
- optimizer: AdamW, weight decay 0.05
- learning rate: cosine annealing with 5-epoch warmup
- regularization: AutoAugment, RandAugment (B2/B3), Mixup, CutMix, DropPath, BN momentum decay
- training epochs: 600
- resolution: 224$\times$224
- channel-last memory format
- B0/B1: batch 512, lr $10^{-3}$. B2: batch 1024, lr $10^{-3}$. B3: batch 2400, lr $3 \times 10^{-3}$
- downstream: RetinaNet on COCO (12 epochs, 1x schedule), Semantic FPN on ADE20K (40k iterations, batch 32)
evaluation
ImageNet classification (CPU latency):
| backbone | params | MACs | top-1 | pi 5 CPU (ms) | pixel 7 CPU (ms) | intel CPU (ms) |
|---|---|---|---|---|---|---|
| CPUBone-B0 | 5.4M | 562M | 78.7 | 42.3 | 10.5 | 4.7 |
| FasterNet-T0 | 2.2M | 376M | 77.2 | ~45 | – | – |
| MobileNetV3-Large | 5.4M | 219M | 75.2 | 158 | – | – |
| MobileNetV4-Conv-M | 4.3M | 507M | 78.9 | – | – | – |
| CPUBone-B3 | 44.9M | 2977M | 84.1 | 157.1 | 44.1 | 18.6 |
| RepViT-M2.3 | 52M | 4031M | 83.5 | considerably slower | – | – |
CPUBone-B0 runs at 42.3ms on raspberry pi 5 CPU – 3.7x faster than MobileNetV3-Large at higher accuracy. B0 vs FasterNet-T0: 1.5% higher accuracy at similar latency despite 1.5x more MACs, because grouped convolutions maintain higher MACpS.
downstream tasks (pi 5 CPU):
| backbone | COCO AP | ADE20K mIoU | latency (ms) |
|---|---|---|---|
| CPUBone-B0 | 37.5 | 37.9 | 131.5 |
| CPUBone-B2 | 40.4 | 42.1 | 338.2 |
| comparable models | ~40.0 | ~41.0 | ~1000+ |
up to 3x faster execution than comparable models at similar or higher detection/segmentation quality.
where it loses: on GPU throughput, the grouped convolutions and 2$\times$2 kernels hurt performance. CPUBone is explicitly not designed for GPU use – it trades GPU efficiency for CPU efficiency.
reproduction guide
git clone https://github.com/altair199797/CPUBone.git
cd CPUBone
pip install torch torchvision timm mmcv mmdet mmseg
# training (8 GPUs, ImageNet)
python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model cpubone_b0 --data /path/to/imagenet \
--batch-size 512 --lr 1e-3 --epochs 600 \
--opt adamw --weight-decay 0.05 --sched cosine --warmup-epochs 5
gotchas:
- latency measurement must use batch_size=1. multi-batch gives misleading results on CPUs
- use ONNX/TFLite export for accurate CPU benchmarking – PyTorch eager mode adds overhead
- nearest-neighbor upsampling in LowFormer attention is critical (33.5ms vs 39.7ms on pi 5 for B1 with transpose conv)
- groups=2 is optimal. groups=4 degrades accuracy by 0.3%, groups=8 by 0.7%, with minimal latency improvement
- channel-last memory format is important for training throughput
compute: B0/B1 training needs 8+ GPUs at 512 batch size for 600 epochs. inference runs on bare CPU – no GPU needed at deployment.
notes
CPUBone is directly relevant to bopi’s embedded/hardware interest. the 5.4M parameter B0 model achieving 78.7% top-1 at 42.3ms on a raspberry pi 5 CPU is genuinely deployable on edge hardware. the design principles (grouped convolutions, small kernels, fused variants for low-channel stages) are simple enough to apply to custom architectures for robotics perception on MCUs.
the key lesson for embedded AI practitioners: stop optimizing for MACs alone. on CPUs, the metric that matters is MACs / MACpS. DWConv has low MACs but terrible MACpS, making it deceptively inefficient. grouped convolutions (groups=2) give the best trade-off: they halve the MACs of the expansion conv while maintaining most of the MACpS of a full convolution.
important caveat: all benchmarks are on ARM application processors (pi 5, pixel 7 pro, snapdragon) and intel xeon. no results on actual MCUs (STM32, ESP32). the principles should transfer but need verification at the lower end.