PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Problem

Large language models at FP16 require ~18 GB for a 9B-parameter model, exceeding consumer GPU memory. Quantization to 4 bits reduces this to ~5–6 GB, but naive methods sacrifice significant quality. The core problem is that widely used absmax quantization assumes a uniform distribution over $[-\alpha, \alpha]$ (where $\alpha = \max \mid w_i \mid$), which poorly matches the empirically observed near-Gaussian weight distributions of LLMs. This wastes codebook entries on rarely occurring outlier magnitudes and concentrates quantization errors in the high-density central region.

Prior art and their limitations:

Absmax quantization [Jacob et al., 2018]: Computationally trivial, but provably suboptimal for non-uniform distributions. Places quantization levels uniformly, wasting resolution in the Gaussian tails.
GPTQ [Frantar et al., ICLR 2023]: Layer-wise quantization using approximate Hessian information via the optimal brain surgeon framework. Achieves strong results but requires calibration data and is computationally expensive for large models.
AWQ [Lin et al., MLSys 2024]: Activation-aware per-channel scaling to protect important channels. Requires calibration data. Operates on channels (inter-block) rather than within weight blocks.
NF4 (NormalFloat) [Dettmers et al., NeurIPS 2023]: Designs codebooks optimal for normally distributed weights by spacing levels uniformly in the quantile domain. Assumes Gaussianity a priori rather than explicitly transforming to it. Information-theoretically optimal for equal-probability bins but does not minimize MSE.
QuIP/QuIP# [Chee et al., NeurIPS 2023 / ICML 2024]: Applies random incoherence processing (randomized Hadamard transforms + lattice codebooks) for 2-bit quantization. Operates on entire weight matrix columns (inter-block) and bounds worst-case error. Computationally heavier and less composable.
QuaRot [Ashkboos et al., NeurIPS 2024]: Hadamard rotations on hidden states, activations, and KV cache to remove outliers. Requires graph surgery to absorb rotations into adjacent layers (inter-layer), modifying the model graph.
SpinQuant [Liu et al., ICLR 2025]: Learned rotation matrices outperforming fixed Hadamard rotations by up to 16 points on zero-shot tasks. Also operates between layers requiring graph modification.
TurboQuant [Ashkboos et al., 2025]: Applies polar quantization to KV cache compression during inference, proving information-theoretic lower bounds and achieving near-optimal distortion rates. PolarQuant adapts this framework from KV cache to weight compression.

PolarQuant’s key differentiation: it applies the Hadamard rotation within blocks (intra-block, block size $d=128$) without modifying the model graph, requires no calibration data for its core algorithm, and is fully composable with any downstream quantizer.

Architecture

PolarQuant is a post-training weight quantization method, not a model architecture. It operates on any pre-trained weight tensor $W \in \mathbb{R}^{m \times n}$ in four stages:

Stage 1: Block Decomposition and Normalization

Flatten $W$ and partition into blocks ${b_i}_{i=1}^{N}$ of size $d=128$. Extract the $\ell_2$ norm $r_i = |b_i|_2$ and normalize each block to the unit hypersphere:

\[\hat{b}\_i = \frac{b\_i}{r\_i}\]

Stage 2: Hadamard Rotation

Apply the $d \times d$ normalized Walsh–Hadamard matrix $H_d$:

\[\tilde{b}\_i = H\_d \hat{b}\_i\]

The Walsh–Hadamard matrix is defined recursively:

\[H\_1 = [1], \qquad H\_{2d} = \frac{1}{\sqrt{2}} \begin{bmatrix} H\_d & H\_d \\ H\_d & -H\_d \end{bmatrix}\]

This matrix is orthogonal ($H_d H_d^\top = I_d$), symmetric ($H_d^\top = H_d$), and self-inverse ($H_d^{-1} = H_d$), requiring no additional storage for the inverse.

Theoretical justification (Proposition 3.2): After rotation, each coordinate $\tilde{b}_{i,j} = (H_d \hat{b}_i)_j$ satisfies $\sqrt{d} \cdot \tilde{b}_{i,j} \to \mathcal{N}(0, 1)$ as $d \to \infty$ by the central limit theorem for projections of the sphere. For $d = 128$, the Kolmogorov–Smirnov statistic between rotated LLM weight coordinates and $\mathcal{N}(0, 1/d)$ is typically below 0.01.

Stage 3: Scaling and Quantization

Scale to unit variance: $z_i = \sqrt{d} \cdot \tilde{b}_i$, so $z_{i,j} \sim \mathcal{N}(0, 1)$. Quantize each element to the nearest Lloyd–Max centroid:

\[q\_{i,j} = \arg\min\_k \mid z\_{i,j} - c\_k \mid\]

The Lloyd–Max algorithm [Lloyd 1982, Max 1960] computes the MSE-optimal scalar quantizer for $\mathcal{N}(0, 1)$ with $L = 2^b$ levels. The optimal centroids satisfy:

\[c\_i = \frac{\phi(t\_{i-1}) - \phi(t\_i)}{\Phi(t\_i) - \Phi(t\_{i-1})}, \qquad t\_i = \frac{c\_i + c\_{i+1}}{2}\]

where $\phi(\cdot)$ and $\Phi(\cdot)$ are the standard normal PDF and CDF respectively. The quantizer is symmetric ($c_i = -c_{L+1-i}$), halving storage. Convergence to machine precision within 50 iterations; 100 iterations used for safety.

MSE advantage over absmax (Proposition 3.6): At $b = 3$, the Lloyd–Max quantizer achieves at most 46% of the MSE of absmax — a 54% MSE reduction.

Stage 4: Storage

Quantized codes: int8 per element (5 bits used, packed)
Per-block norms: fp16, one per block of 128 elements = 0.125 bits/weight overhead
Centroid table: $2^b$ fp32 values, shared globally and negligible

Dequantization

Exact inverse: look up centroids from codes, scale by $1/\sqrt{d}$, apply inverse Hadamard rotation ($H_d^{-1} = H_d$), and scale by stored norm $r_i$. Zero runtime overhead.

Complexity

The Walsh–Hadamard transform admits an $O(d \log d)$ fast implementation (analogous to FFT), making PolarQuant linear in the number of weights. For $d = 128$, torch.matmul with $H_{128} \hat{b}_i$ achieves 25x faster execution than a naive fast Walsh–Hadamard transform implementation by leveraging optimized cuBLAS GEMM kernels. Full dequantization of a 9B model takes ~4 seconds on RTX PRO 6000 Blackwell.

Combined Pipeline: PolarQuant + AWQ

AWQ and PolarQuant operate on orthogonal axes:

Compute AWQ per-channel scales $s$ from calibration data
$W’ = W \cdot \text{diag}(s)$
Apply PolarQuant to $W’$
At dequant: $\hat{W} = \hat{W}’ \cdot \text{diag}(s^{-1})$

Preprocessing for INT4 Inference

PolarQuant Q5 can serve as a preprocessing step for downstream INT4 quantization:

\[W \xrightarrow{\text{PolarQuant Q5}} \hat{W}\_{PQ} \xrightarrow{\text{dequant BF16}} \hat{W}\_{BF16} \xrightarrow{\text{torchao INT4}} \hat{W}\_{INT4}\]

This is not traditional double quantization — PolarQuant acts as a distributional regularizer: Hadamard rotation homogenizes the weight distribution, producing groups with fewer outliers and more consistent dynamic range for the downstream absmax INT4 quantizer.

Training

PolarQuant is a post-training quantization method — there is no training phase. The entire quantization process is a deterministic forward pass with no gradient computation, no iterative optimization, and no calibration data (for the core algorithm; AWQ, if used, requires calibration).

Hardware used for experiments:

Primary: NVIDIA RTX PRO 6000 Blackwell GPU with 96 GB VRAM
Cross-platform: Apple Mac mini M4 with 16 GB unified memory

Evaluation setup:

Model: Qwen3.5-9B (~9 billion parameters, hybrid DeltaNet + MoE architecture)
Benchmark: WikiText-2 perplexity, sliding window of 2048 tokens with stride 512, masking first 1536 context tokens per window
Speed: Average of 3 runs of 100 generated tokens, after warmup, in tokens/second
All perplexity numbers are deterministic and reproducible

Dequantization overhead: ~8 seconds added to model load time on RTX PRO 6000 Blackwell (one-time cost). Zero runtime overhead at inference.

Evaluation

Main Results on Qwen3.5-9B (RTX PRO 6000 Blackwell)

Method	tok/s	VRAM	PPL	Δ from FP16
FP16 baseline	45.7	17.9 GB	6.37	—
torchao INT4 (absmax)	43.3	6.3 GB	6.68	+0.31
BnB NF4	34.6	7.7 GB	~6.7	+0.33
PolarQuant Q5 + torchao INT4	43.1	6.5 GB	6.56	+0.19
PolarQuant Q5 dequant (FP16)	45.9	18.1 GB	6.39	+0.02
PolarQuant + AWQ dequant (FP16)	45.8	17.9 GB	6.43	+0.06

Key findings:

PolarQuant Q5 dequant achieves near-lossless compression: PPL 6.39 vs 6.37 FP16 ($\Delta = +0.02$) with no calibration data
PolarQuant Q5 + torchao INT4 achieves the best perplexity among all INT4 methods (6.56 vs 6.68), reducing the gap to FP16 by 39% while maintaining comparable speed (43.1 vs 43.3 tok/s) and near-identical memory (6.5 vs 6.3 GB)
PolarQuant Q5 alone outperforms PolarQuant+AWQ (6.39 vs 6.43) in the dequantized FP16 regime, since uniform Q5 preserves more information than mixed-bit allocation
PolarQuant Q5 dequant runs at full FP16 speed (45.9 tok/s) making it suitable as a high-fidelity compressed storage format

Cross-Platform Results (Apple Mac mini M4, 16 GB)

Method	tok/s	Memory	PPL
PolarQuant MLX Q4	19.7	4.8 GB	6.90

A 9B parameter model runs on a 16 GB consumer device at nearly 20 tok/s.

Ablation Study (Qwen3.5-9B, Q5)

Configuration	PPL	Δ from FP16	Contribution
FP16 baseline	6.37	—	—
Absmax Q5 (baseline)	6.9030	+0.53	—
+ Hadamard rotation only	6.4010	+0.03	98%
+ Lloyd–Max centroids only	6.9139	+0.54	-2%
+ Both (PolarQuant Q5)	6.3909	+0.02	100%
+ AWQ scales	6.43	+0.06	—
+ torchao INT4 on top	6.56	+0.19	—

Hadamard rotation alone accounts for 98% of the quality improvement at Q5. Lloyd–Max centroids provide only marginal additional gain ($\Delta = -0.01$). At $b = 5$ bits (32 levels), the density of levels is sufficient to approximate Gaussian well even with uniformly spaced centroids. Lloyd–Max centroids would contribute more at lower bit widths (e.g., $b = 2$ or $b = 3$), consistent with the 54% MSE reduction at Q3.

Version Evolution

Version	Technique	PPL	Improvement
v1	Absmax	7.26	baseline
v2	+ AWQ	7.05	-0.21
v3	+ PolarQuant + AWQ	6.43	-0.83
v5	PolarQuant Q5 + torchao	6.56	-0.70

The transition from v1 to v3 reduced the perplexity delta from +0.89 to +0.06, a 93% reduction in quantization-induced quality loss.

Storage and Compression

Format	Bits/weight	Overhead	Total bpw	Compression
FP16	16.0	—	16.0	1.0x
PolarQuant Q5	5.0	0.125	5.125	3.1x
PolarQuant Q5 + AWQ	5.0	0.125 + scales	~5.2	3.1x
PolarQuant Q5 + torchao INT4	4.0	—	4.0	4.0x
PolarQuant Q4 (MLX)	4.0	0.125	4.125	3.9x

Lloyd–Max Centroid Values

Bits	Levels	MSE	Non-negative Centroids
2	4	0.1175	+0.4528, +1.5104
3	8	0.03454	+0.2451, +0.7560, +1.3440, +2.1520
4	16	0.009497	(computed numerically)
5	32	0.002499	(computed numerically)

Reproduction Guide

Installation

git clone https://github.com/caiovicentino/eoq-quantization.git
cd eoq-quantization
pip install -r requirements.txt

Expected dependencies: PyTorch, torchao, transformers (for model loading), and optionally MLX for Apple Silicon support.

Quantization

The core algorithm requires only the model weights:

Load the target model (e.g., Qwen3.5-9B)
Flatten weight tensors and partition into blocks of size $d = 128$
Compute per-block $\ell_2$ norms $r_i = |b_i|_2$
Normalize: $\hat{b}_i = b_i / r_i$
Construct the $128 \times 128$ normalized Hadamard matrix $H_{128}$
Rotate: $\tilde{b}_i = H_{128} \hat{b}_i$
Scale: $z_i = \sqrt{128} \cdot \tilde{b}_i$
Quantize to nearest Lloyd–Max centroid: $q_{i,j} = \arg\min_k \mid z_{i,j} - c_k \mid$
Save codes, norms, and global centroid table

Dequantization and Inference

Load quantized codes, per-block norms, and centroid table
For each block: look up centroids from codes, scale by $1/\sqrt{128}$, apply $H_{128}$ (self-inverse), multiply by stored norm
Reshape to original weight matrix dimensions
Run inference with standard FP16 or INT4 backend

With AWQ (Optional)

Compute AWQ per-channel scales from calibration data
Apply scales to weights before PolarQuant
Inverse scales after dequantization

With torchao INT4

Dequantize PolarQuant Q5 weights to BF16
Re-quantize with torchao INT4 (group size 128)
Expected result: PPL ~6.56 on Qwen3.5-9B WikiText-2

Verify

WikiText-2 perplexity: use sliding window of 2048 tokens, stride 512, mask first 1536 context tokens per window
Target: PPL 6.39 for PolarQuant Q5 dequant, 6.56 for PolarQuant Q5 + torchao INT4
Pre-trained quantized models available at https://huggingface.co/caiovicentino1

Notes

Key takeaways:

Rotation is the key insight. The ablation conclusively shows that Hadamard rotation accounts for 98% of the quality improvement. Lloyd–Max centroids are a minor refinement at Q5. This simplifies the method to its essential component: a deterministic orthogonal rotation that transforms weight blocks into approximately i.i.d. Gaussian variables.
Intra-block vs. inter-block rotation. Unlike QuaRot/SpinQuant (which rotate between layers, requiring graph surgery) and QuIP# (which rotates entire weight columns), PolarQuant rotates within each 128-element block independently. This requires no model graph modification and is trivially composable.
Self-inverse Hadamard is a free lunch. The Hadamard matrix being its own inverse ($H_d^{-1} = H_d$) means dequantization is equally simple with zero additional parameter storage.
Distributional regularization for cascaded quantization. The finding that PolarQuant Q5 improves downstream torchao INT4 reveals a general principle: the preprocessing quantizer must operate at a sufficiently high bit width to preserve information. Q3 as preprocessing degrades quality (PPL 7.25 vs 6.56) because 8 centroids lose too much signal.
No calibration needed. Unlike GPTQ, AWQ, and most other post-training quantizers, the core PolarQuant algorithm requires no calibration data. Only the optional AWQ combination uses calibration.

Connections to other work:

TurboQuant [Ashkboos et al., 2025] is the direct intellectual predecessor, applying the same polar quantization framework to KV cache compression. PolarQuant extends this to weight compression and provides the first ablation quantifying rotation (98%) vs. optimal centroids (2%).
QuaRot [Ashkboos et al., NeurIPS 2024] and SpinQuant [Liu et al., ICLR 2025] also use Hadamard rotations but apply them between layers (inter-layer). SpinQuant shows learned rotations can outperform fixed Hadamard by up to 16 points on zero-shot tasks — an interesting direction for future PolarQuant improvements.
QuIP# [Chee et al., ICML 2024] shares the use of Hadamard transforms but targets worst-case error bounds via incoherence processing rather than distributional matching.
NF4 [Dettmers et al., NeurIPS 2023] also targets Gaussian weight distributions but assumes Gaussianity a priori; PolarQuant explicitly achieves it via rotation.
The method is specifically evaluated on Qwen3.5-9B, a hybrid DeltaNet + MoE architecture. The Gaussian approximation may be less precise for architectures with very different weight distributions.

Limitations:

Assumption that Hadamard-rotated blocks are well-approximated by i.i.d. Gaussians may not hold for all architectures
No exploitation of inter-block correlations
Not evaluated on lower-bit regimes (Q2, Q3) as a standalone method or on zero-shot task benchmarks
Single model evaluation (Qwen3.5-9B); broader model coverage would strengthen claims