2026-04-02
PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
Caio Vicentino
Problem
Large language models at FP16 require ~18 GB for a 9B-parameter model, exceeding consumer GPU memory. Quantization to 4 bits reduces this to ~5–6 GB, but naive methods sacrifice significant quality. The core problem is that widely used absmax quantization assumes a uniform distribution over $[-\alpha, \alpha]$ (where $\alpha = \max \mid w_i \mid$), which poorly matches the empirically observed near-Gaussian weight distributions of LLMs. This wastes codebook entries on rarely occurring outlier magnitudes and concentrates quantization errors in the high-density central region.
Prior art and their limitations:
- Absmax quantization [Jacob et al., 2018]: Computationally trivial, but provably suboptimal for non-uniform distributions. Places quantization levels uniformly, wasting resolution in the Gaussian tails.
- GPTQ [Frantar et al., ICLR 2023]: Layer-wise quantization using approximate Hessian information via the optimal brain surgeon framework. Achieves strong results but requires calibration data and is computationally expensive for large models.
- AWQ [Lin et al., MLSys 2024]: Activation-aware per-channel scaling to protect important channels. Requires calibration data. Operates on channels (inter-block) rather than within weight blocks.
- NF4 (NormalFloat) [Dettmers et al., NeurIPS 2023]: Designs codebooks optimal for normally distributed weights by spacing levels uniformly in the quantile domain. Assumes Gaussianity a priori rather than explicitly transforming to it. Information-theoretically optimal for equal-probability bins but does not minimize MSE.
- QuIP/QuIP# [Chee et al., NeurIPS 2023 / ICML 2024]: Applies random incoherence processing (randomized Hadamard transforms + lattice codebooks) for 2-bit quantization. Operates on entire weight matrix columns (inter-block) and bounds worst-case error. Computationally heavier and less composable.
- QuaRot [Ashkboos et al., NeurIPS 2024]: Hadamard rotations on hidden states, activations, and KV cache to remove outliers. Requires graph surgery to absorb rotations into adjacent layers (inter-layer), modifying the model graph.
- SpinQuant [Liu et al., ICLR 2025]: Learned rotation matrices outperforming fixed Hadamard rotations by up to 16 points on zero-shot tasks. Also operates between layers requiring graph modification.
- TurboQuant [Ashkboos et al., 2025]: Applies polar quantization to KV cache compression during inference, proving information-theoretic lower bounds and achieving near-optimal distortion rates. PolarQuant adapts this framework from KV cache to weight compression.
PolarQuant’s key differentiation: it applies the Hadamard rotation within blocks (intra-block, block size $d=128$) without modifying the model graph, requires no calibration data for its core algorithm, and is fully composable with any downstream quantizer.
Architecture
PolarQuant is a post-training weight quantization method, not a model architecture. It operates on any pre-trained weight tensor $W \in \mathbb{R}^{m \times n}$ in four stages:
Stage 1: Block Decomposition and Normalization
Flatten $W$ and partition into blocks ${b_i}_{i=1}^{N}$ of size $d=128$. Extract the $\ell_2$ norm $r_i = |b_i|_2$ and normalize each block to the unit hypersphere:
\[\hat{b}\_i = \frac{b\_i}{r\_i}\]Stage 2: Hadamard Rotation
Apply the $d \times d$ normalized Walsh–Hadamard matrix $H_d$:
\[\tilde{b}\_i = H\_d \hat{b}\_i\]The Walsh–Hadamard matrix is defined recursively:
\[H\_1 = [1], \qquad H\_{2d} = \frac{1}{\sqrt{2}} \begin{bmatrix} H\_d & H\_d \\ H\_d & -H\_d \end{bmatrix}\]This matrix is orthogonal ($H_d H_d^\top = I_d$), symmetric ($H_d^\top = H_d$), and self-inverse ($H_d^{-1} = H_d$), requiring no additional storage for the inverse.
Theoretical justification (Proposition 3.2): After rotation, each coordinate $\tilde{b}_{i,j} = (H_d \hat{b}_i)_j$ satisfies $\sqrt{d} \cdot \tilde{b}_{i,j} \to \mathcal{N}(0, 1)$ as $d \to \infty$ by the central limit theorem for projections of the sphere. For $d = 128$, the Kolmogorov–Smirnov statistic between rotated LLM weight coordinates and $\mathcal{N}(0, 1/d)$ is typically below 0.01.
Stage 3: Scaling and Quantization
Scale to unit variance: $z_i = \sqrt{d} \cdot \tilde{b}_i$, so $z_{i,j} \sim \mathcal{N}(0, 1)$. Quantize each element to the nearest Lloyd–Max centroid:
\[q\_{i,j} = \arg\min\_k \mid z\_{i,j} - c\_k \mid\]The Lloyd–Max algorithm [Lloyd 1982, Max 1960] computes the MSE-optimal scalar quantizer for $\mathcal{N}(0, 1)$ with $L = 2^b$ levels. The optimal centroids satisfy:
\[c\_i = \frac{\phi(t\_{i-1}) - \phi(t\_i)}{\Phi(t\_i) - \Phi(t\_{i-1})}, \qquad t\_i = \frac{c\_i + c\_{i+1}}{2}\]where $\phi(\cdot)$ and $\Phi(\cdot)$ are the standard normal PDF and CDF respectively. The quantizer is symmetric ($c_i = -c_{L+1-i}$), halving storage. Convergence to machine precision within 50 iterations; 100 iterations used for safety.
MSE advantage over absmax (Proposition 3.6): At $b = 3$, the Lloyd–Max quantizer achieves at most 46% of the MSE of absmax — a 54% MSE reduction.
Stage 4: Storage
- Quantized codes: int8 per element (5 bits used, packed)
- Per-block norms: fp16, one per block of 128 elements = 0.125 bits/weight overhead
- Centroid table: $2^b$ fp32 values, shared globally and negligible
Dequantization
Exact inverse: look up centroids from codes, scale by $1/\sqrt{d}$, apply inverse Hadamard rotation ($H_d^{-1} = H_d$), and scale by stored norm $r_i$. Zero runtime overhead.
Complexity
The Walsh–Hadamard transform admits an $O(d \log d)$ fast implementation (analogous to FFT), making PolarQuant linear in the number of weights. For $d = 128$, torch.matmul with $H_{128} \hat{b}_i$ achieves 25x faster execution than a naive fast Walsh–Hadamard transform implementation by leveraging optimized cuBLAS GEMM kernels. Full dequantization of a 9B model takes ~4 seconds on RTX PRO 6000 Blackwell.
Combined Pipeline: PolarQuant + AWQ
AWQ and PolarQuant operate on orthogonal axes:
- Compute AWQ per-channel scales $s$ from calibration data
- $W’ = W \cdot \text{diag}(s)$
- Apply PolarQuant to $W’$
- At dequant: $\hat{W} = \hat{W}’ \cdot \text{diag}(s^{-1})$
Preprocessing for INT4 Inference
PolarQuant Q5 can serve as a preprocessing step for downstream INT4 quantization:
\[W \xrightarrow{\text{PolarQuant Q5}} \hat{W}\_{PQ} \xrightarrow{\text{dequant BF16}} \hat{W}\_{BF16} \xrightarrow{\text{torchao INT4}} \hat{W}\_{INT4}\]This is not traditional double quantization — PolarQuant acts as a distributional regularizer: Hadamard rotation homogenizes the weight distribution, producing groups with fewer outliers and more consistent dynamic range for the downstream absmax INT4 quantizer.
Training
PolarQuant is a post-training quantization method — there is no training phase. The entire quantization process is a deterministic forward pass with no gradient computation, no iterative optimization, and no calibration data (for the core algorithm; AWQ, if used, requires calibration).
Hardware used for experiments:
- Primary: NVIDIA RTX PRO 6000 Blackwell GPU with 96 GB VRAM
- Cross-platform: Apple Mac mini M4 with 16 GB unified memory
Evaluation setup:
- Model: Qwen3.5-9B (~9 billion parameters, hybrid DeltaNet + MoE architecture)
- Benchmark: WikiText-2 perplexity, sliding window of 2048 tokens with stride 512, masking first 1536 context tokens per window
- Speed: Average of 3 runs of 100 generated tokens, after warmup, in tokens/second
- All perplexity numbers are deterministic and reproducible
Dequantization overhead: ~8 seconds added to model load time on RTX PRO 6000 Blackwell (one-time cost). Zero runtime overhead at inference.
Evaluation
Main Results on Qwen3.5-9B (RTX PRO 6000 Blackwell)
| Method | tok/s | VRAM | PPL | Δ from FP16 |
|---|---|---|---|---|
| FP16 baseline | 45.7 | 17.9 GB | 6.37 | — |
| torchao INT4 (absmax) | 43.3 | 6.3 GB | 6.68 | +0.31 |
| BnB NF4 | 34.6 | 7.7 GB | ~6.7 | +0.33 |
| PolarQuant Q5 + torchao INT4 | 43.1 | 6.5 GB | 6.56 | +0.19 |
| PolarQuant Q5 dequant (FP16) | 45.9 | 18.1 GB | 6.39 | +0.02 |
| PolarQuant + AWQ dequant (FP16) | 45.8 | 17.9 GB | 6.43 | +0.06 |
Key findings:
- PolarQuant Q5 dequant achieves near-lossless compression: PPL 6.39 vs 6.37 FP16 ($\Delta = +0.02$) with no calibration data
- PolarQuant Q5 + torchao INT4 achieves the best perplexity among all INT4 methods (6.56 vs 6.68), reducing the gap to FP16 by 39% while maintaining comparable speed (43.1 vs 43.3 tok/s) and near-identical memory (6.5 vs 6.3 GB)
- PolarQuant Q5 alone outperforms PolarQuant+AWQ (6.39 vs 6.43) in the dequantized FP16 regime, since uniform Q5 preserves more information than mixed-bit allocation
- PolarQuant Q5 dequant runs at full FP16 speed (45.9 tok/s) making it suitable as a high-fidelity compressed storage format
Cross-Platform Results (Apple Mac mini M4, 16 GB)
| Method | tok/s | Memory | PPL |
|---|---|---|---|
| PolarQuant MLX Q4 | 19.7 | 4.8 GB | 6.90 |
A 9B parameter model runs on a 16 GB consumer device at nearly 20 tok/s.
Ablation Study (Qwen3.5-9B, Q5)
| Configuration | PPL | Δ from FP16 | Contribution |
|---|---|---|---|
| FP16 baseline | 6.37 | — | — |
| Absmax Q5 (baseline) | 6.9030 | +0.53 | — |
| + Hadamard rotation only | 6.4010 | +0.03 | 98% |
| + Lloyd–Max centroids only | 6.9139 | +0.54 | -2% |
| + Both (PolarQuant Q5) | 6.3909 | +0.02 | 100% |
| + AWQ scales | 6.43 | +0.06 | — |
| + torchao INT4 on top | 6.56 | +0.19 | — |
Hadamard rotation alone accounts for 98% of the quality improvement at Q5. Lloyd–Max centroids provide only marginal additional gain ($\Delta = -0.01$). At $b = 5$ bits (32 levels), the density of levels is sufficient to approximate Gaussian well even with uniformly spaced centroids. Lloyd–Max centroids would contribute more at lower bit widths (e.g., $b = 2$ or $b = 3$), consistent with the 54% MSE reduction at Q3.
Version Evolution
| Version | Technique | PPL | Improvement |
|---|---|---|---|
| v1 | Absmax | 7.26 | baseline |
| v2 | + AWQ | 7.05 | -0.21 |
| v3 | + PolarQuant + AWQ | 6.43 | -0.83 |
| v5 | PolarQuant Q5 + torchao | 6.56 | -0.70 |
The transition from v1 to v3 reduced the perplexity delta from +0.89 to +0.06, a 93% reduction in quantization-induced quality loss.
Storage and Compression
| Format | Bits/weight | Overhead | Total bpw | Compression |
|---|---|---|---|---|
| FP16 | 16.0 | — | 16.0 | 1.0x |
| PolarQuant Q5 | 5.0 | 0.125 | 5.125 | 3.1x |
| PolarQuant Q5 + AWQ | 5.0 | 0.125 + scales | ~5.2 | 3.1x |
| PolarQuant Q5 + torchao INT4 | 4.0 | — | 4.0 | 4.0x |
| PolarQuant Q4 (MLX) | 4.0 | 0.125 | 4.125 | 3.9x |
Lloyd–Max Centroid Values
| Bits | Levels | MSE | Non-negative Centroids |
|---|---|---|---|
| 2 | 4 | 0.1175 | +0.4528, +1.5104 |
| 3 | 8 | 0.03454 | +0.2451, +0.7560, +1.3440, +2.1520 |
| 4 | 16 | 0.009497 | (computed numerically) |
| 5 | 32 | 0.002499 | (computed numerically) |
Reproduction Guide
Installation
git clone https://github.com/caiovicentino/eoq-quantization.git
cd eoq-quantization
pip install -r requirements.txt
Expected dependencies: PyTorch, torchao, transformers (for model loading), and optionally MLX for Apple Silicon support.
Quantization
The core algorithm requires only the model weights:
- Load the target model (e.g., Qwen3.5-9B)
- Flatten weight tensors and partition into blocks of size $d = 128$
- Compute per-block $\ell_2$ norms $r_i = |b_i|_2$
- Normalize: $\hat{b}_i = b_i / r_i$
- Construct the $128 \times 128$ normalized Hadamard matrix $H_{128}$
- Rotate: $\tilde{b}_i = H_{128} \hat{b}_i$
- Scale: $z_i = \sqrt{128} \cdot \tilde{b}_i$
- Quantize to nearest Lloyd–Max centroid: $q_{i,j} = \arg\min_k \mid z_{i,j} - c_k \mid$
- Save codes, norms, and global centroid table
Dequantization and Inference
- Load quantized codes, per-block norms, and centroid table
- For each block: look up centroids from codes, scale by $1/\sqrt{128}$, apply $H_{128}$ (self-inverse), multiply by stored norm
- Reshape to original weight matrix dimensions
- Run inference with standard FP16 or INT4 backend
With AWQ (Optional)
- Compute AWQ per-channel scales from calibration data
- Apply scales to weights before PolarQuant
- Inverse scales after dequantization
With torchao INT4
- Dequantize PolarQuant Q5 weights to BF16
- Re-quantize with torchao INT4 (group size 128)
- Expected result: PPL ~6.56 on Qwen3.5-9B WikiText-2
Verify
- WikiText-2 perplexity: use sliding window of 2048 tokens, stride 512, mask first 1536 context tokens per window
- Target: PPL 6.39 for PolarQuant Q5 dequant, 6.56 for PolarQuant Q5 + torchao INT4
- Pre-trained quantized models available at https://huggingface.co/caiovicentino1
Notes
Key takeaways:
-
Rotation is the key insight. The ablation conclusively shows that Hadamard rotation accounts for 98% of the quality improvement. Lloyd–Max centroids are a minor refinement at Q5. This simplifies the method to its essential component: a deterministic orthogonal rotation that transforms weight blocks into approximately i.i.d. Gaussian variables.
-
Intra-block vs. inter-block rotation. Unlike QuaRot/SpinQuant (which rotate between layers, requiring graph surgery) and QuIP# (which rotates entire weight columns), PolarQuant rotates within each 128-element block independently. This requires no model graph modification and is trivially composable.
-
Self-inverse Hadamard is a free lunch. The Hadamard matrix being its own inverse ($H_d^{-1} = H_d$) means dequantization is equally simple with zero additional parameter storage.
-
Distributional regularization for cascaded quantization. The finding that PolarQuant Q5 improves downstream torchao INT4 reveals a general principle: the preprocessing quantizer must operate at a sufficiently high bit width to preserve information. Q3 as preprocessing degrades quality (PPL 7.25 vs 6.56) because 8 centroids lose too much signal.
-
No calibration needed. Unlike GPTQ, AWQ, and most other post-training quantizers, the core PolarQuant algorithm requires no calibration data. Only the optional AWQ combination uses calibration.
Connections to other work:
- TurboQuant [Ashkboos et al., 2025] is the direct intellectual predecessor, applying the same polar quantization framework to KV cache compression. PolarQuant extends this to weight compression and provides the first ablation quantifying rotation (98%) vs. optimal centroids (2%).
- QuaRot [Ashkboos et al., NeurIPS 2024] and SpinQuant [Liu et al., ICLR 2025] also use Hadamard rotations but apply them between layers (inter-layer). SpinQuant shows learned rotations can outperform fixed Hadamard by up to 16 points on zero-shot tasks — an interesting direction for future PolarQuant improvements.
- QuIP# [Chee et al., ICML 2024] shares the use of Hadamard transforms but targets worst-case error bounds via incoherence processing rather than distributional matching.
- NF4 [Dettmers et al., NeurIPS 2023] also targets Gaussian weight distributions but assumes Gaussianity a priori; PolarQuant explicitly achieves it via rotation.
- The method is specifically evaluated on Qwen3.5-9B, a hybrid DeltaNet + MoE architecture. The Gaussian approximation may be less precise for architectures with very different weight distributions.
Limitations:
- Assumption that Hadamard-rotated blocks are well-approximated by i.i.d. Gaussians may not hold for all architectures
- No exploitation of inter-block correlations
- Not evaluated on lower-bit regimes (Q2, Q3) as a standalone method or on zero-shot task benchmarks
- Single model evaluation (Qwen3.5-9B); broader model coverage would strengthen claims