2026-04-02

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Caio Vicentino

quantization llm-compression post-training

Problem

Large language models at FP16 require ~18 GB for a 9B-parameter model, exceeding consumer GPU memory. Quantization to 4 bits reduces this to ~5–6 GB, but naive methods sacrifice significant quality. The core problem is that widely used absmax quantization assumes a uniform distribution over $[-\alpha, \alpha]$ (where $\alpha = \max \mid w_i \mid$), which poorly matches the empirically observed near-Gaussian weight distributions of LLMs. This wastes codebook entries on rarely occurring outlier magnitudes and concentrates quantization errors in the high-density central region.

Prior art and their limitations:

  • Absmax quantization [Jacob et al., 2018]: Computationally trivial, but provably suboptimal for non-uniform distributions. Places quantization levels uniformly, wasting resolution in the Gaussian tails.
  • GPTQ [Frantar et al., ICLR 2023]: Layer-wise quantization using approximate Hessian information via the optimal brain surgeon framework. Achieves strong results but requires calibration data and is computationally expensive for large models.
  • AWQ [Lin et al., MLSys 2024]: Activation-aware per-channel scaling to protect important channels. Requires calibration data. Operates on channels (inter-block) rather than within weight blocks.
  • NF4 (NormalFloat) [Dettmers et al., NeurIPS 2023]: Designs codebooks optimal for normally distributed weights by spacing levels uniformly in the quantile domain. Assumes Gaussianity a priori rather than explicitly transforming to it. Information-theoretically optimal for equal-probability bins but does not minimize MSE.
  • QuIP/QuIP# [Chee et al., NeurIPS 2023 / ICML 2024]: Applies random incoherence processing (randomized Hadamard transforms + lattice codebooks) for 2-bit quantization. Operates on entire weight matrix columns (inter-block) and bounds worst-case error. Computationally heavier and less composable.
  • QuaRot [Ashkboos et al., NeurIPS 2024]: Hadamard rotations on hidden states, activations, and KV cache to remove outliers. Requires graph surgery to absorb rotations into adjacent layers (inter-layer), modifying the model graph.
  • SpinQuant [Liu et al., ICLR 2025]: Learned rotation matrices outperforming fixed Hadamard rotations by up to 16 points on zero-shot tasks. Also operates between layers requiring graph modification.
  • TurboQuant [Ashkboos et al., 2025]: Applies polar quantization to KV cache compression during inference, proving information-theoretic lower bounds and achieving near-optimal distortion rates. PolarQuant adapts this framework from KV cache to weight compression.

PolarQuant’s key differentiation: it applies the Hadamard rotation within blocks (intra-block, block size $d=128$) without modifying the model graph, requires no calibration data for its core algorithm, and is fully composable with any downstream quantizer.

Architecture

PolarQuant is a post-training weight quantization method, not a model architecture. It operates on any pre-trained weight tensor $W \in \mathbb{R}^{m \times n}$ in four stages:

Stage 1: Block Decomposition and Normalization

Flatten $W$ and partition into blocks ${b_i}_{i=1}^{N}$ of size $d=128$. Extract the $\ell_2$ norm $r_i = |b_i|_2$ and normalize each block to the unit hypersphere:

\[\hat{b}\_i = \frac{b\_i}{r\_i}\]

Stage 2: Hadamard Rotation

Apply the $d \times d$ normalized Walsh–Hadamard matrix $H_d$:

\[\tilde{b}\_i = H\_d \hat{b}\_i\]

The Walsh–Hadamard matrix is defined recursively:

\[H\_1 = [1], \qquad H\_{2d} = \frac{1}{\sqrt{2}} \begin{bmatrix} H\_d & H\_d \\ H\_d & -H\_d \end{bmatrix}\]

This matrix is orthogonal ($H_d H_d^\top = I_d$), symmetric ($H_d^\top = H_d$), and self-inverse ($H_d^{-1} = H_d$), requiring no additional storage for the inverse.

Theoretical justification (Proposition 3.2): After rotation, each coordinate $\tilde{b}_{i,j} = (H_d \hat{b}_i)_j$ satisfies $\sqrt{d} \cdot \tilde{b}_{i,j} \to \mathcal{N}(0, 1)$ as $d \to \infty$ by the central limit theorem for projections of the sphere. For $d = 128$, the Kolmogorov–Smirnov statistic between rotated LLM weight coordinates and $\mathcal{N}(0, 1/d)$ is typically below 0.01.

Stage 3: Scaling and Quantization

Scale to unit variance: $z_i = \sqrt{d} \cdot \tilde{b}_i$, so $z_{i,j} \sim \mathcal{N}(0, 1)$. Quantize each element to the nearest Lloyd–Max centroid:

\[q\_{i,j} = \arg\min\_k \mid z\_{i,j} - c\_k \mid\]

The Lloyd–Max algorithm [Lloyd 1982, Max 1960] computes the MSE-optimal scalar quantizer for $\mathcal{N}(0, 1)$ with $L = 2^b$ levels. The optimal centroids satisfy:

\[c\_i = \frac{\phi(t\_{i-1}) - \phi(t\_i)}{\Phi(t\_i) - \Phi(t\_{i-1})}, \qquad t\_i = \frac{c\_i + c\_{i+1}}{2}\]

where $\phi(\cdot)$ and $\Phi(\cdot)$ are the standard normal PDF and CDF respectively. The quantizer is symmetric ($c_i = -c_{L+1-i}$), halving storage. Convergence to machine precision within 50 iterations; 100 iterations used for safety.

MSE advantage over absmax (Proposition 3.6): At $b = 3$, the Lloyd–Max quantizer achieves at most 46% of the MSE of absmax — a 54% MSE reduction.

Stage 4: Storage

  • Quantized codes: int8 per element (5 bits used, packed)
  • Per-block norms: fp16, one per block of 128 elements = 0.125 bits/weight overhead
  • Centroid table: $2^b$ fp32 values, shared globally and negligible

Dequantization

Exact inverse: look up centroids from codes, scale by $1/\sqrt{d}$, apply inverse Hadamard rotation ($H_d^{-1} = H_d$), and scale by stored norm $r_i$. Zero runtime overhead.

Complexity

The Walsh–Hadamard transform admits an $O(d \log d)$ fast implementation (analogous to FFT), making PolarQuant linear in the number of weights. For $d = 128$, torch.matmul with $H_{128} \hat{b}_i$ achieves 25x faster execution than a naive fast Walsh–Hadamard transform implementation by leveraging optimized cuBLAS GEMM kernels. Full dequantization of a 9B model takes ~4 seconds on RTX PRO 6000 Blackwell.

Combined Pipeline: PolarQuant + AWQ

AWQ and PolarQuant operate on orthogonal axes:

  1. Compute AWQ per-channel scales $s$ from calibration data
  2. $W’ = W \cdot \text{diag}(s)$
  3. Apply PolarQuant to $W’$
  4. At dequant: $\hat{W} = \hat{W}’ \cdot \text{diag}(s^{-1})$

Preprocessing for INT4 Inference

PolarQuant Q5 can serve as a preprocessing step for downstream INT4 quantization:

\[W \xrightarrow{\text{PolarQuant Q5}} \hat{W}\_{PQ} \xrightarrow{\text{dequant BF16}} \hat{W}\_{BF16} \xrightarrow{\text{torchao INT4}} \hat{W}\_{INT4}\]

This is not traditional double quantization — PolarQuant acts as a distributional regularizer: Hadamard rotation homogenizes the weight distribution, producing groups with fewer outliers and more consistent dynamic range for the downstream absmax INT4 quantizer.

Training

PolarQuant is a post-training quantization method — there is no training phase. The entire quantization process is a deterministic forward pass with no gradient computation, no iterative optimization, and no calibration data (for the core algorithm; AWQ, if used, requires calibration).

Hardware used for experiments:

  • Primary: NVIDIA RTX PRO 6000 Blackwell GPU with 96 GB VRAM
  • Cross-platform: Apple Mac mini M4 with 16 GB unified memory

Evaluation setup:

  • Model: Qwen3.5-9B (~9 billion parameters, hybrid DeltaNet + MoE architecture)
  • Benchmark: WikiText-2 perplexity, sliding window of 2048 tokens with stride 512, masking first 1536 context tokens per window
  • Speed: Average of 3 runs of 100 generated tokens, after warmup, in tokens/second
  • All perplexity numbers are deterministic and reproducible

Dequantization overhead: ~8 seconds added to model load time on RTX PRO 6000 Blackwell (one-time cost). Zero runtime overhead at inference.

Evaluation

Main Results on Qwen3.5-9B (RTX PRO 6000 Blackwell)

Method tok/s VRAM PPL Δ from FP16
FP16 baseline 45.7 17.9 GB 6.37
torchao INT4 (absmax) 43.3 6.3 GB 6.68 +0.31
BnB NF4 34.6 7.7 GB ~6.7 +0.33
PolarQuant Q5 + torchao INT4 43.1 6.5 GB 6.56 +0.19
PolarQuant Q5 dequant (FP16) 45.9 18.1 GB 6.39 +0.02
PolarQuant + AWQ dequant (FP16) 45.8 17.9 GB 6.43 +0.06

Key findings:

  • PolarQuant Q5 dequant achieves near-lossless compression: PPL 6.39 vs 6.37 FP16 ($\Delta = +0.02$) with no calibration data
  • PolarQuant Q5 + torchao INT4 achieves the best perplexity among all INT4 methods (6.56 vs 6.68), reducing the gap to FP16 by 39% while maintaining comparable speed (43.1 vs 43.3 tok/s) and near-identical memory (6.5 vs 6.3 GB)
  • PolarQuant Q5 alone outperforms PolarQuant+AWQ (6.39 vs 6.43) in the dequantized FP16 regime, since uniform Q5 preserves more information than mixed-bit allocation
  • PolarQuant Q5 dequant runs at full FP16 speed (45.9 tok/s) making it suitable as a high-fidelity compressed storage format

Cross-Platform Results (Apple Mac mini M4, 16 GB)

Method tok/s Memory PPL
PolarQuant MLX Q4 19.7 4.8 GB 6.90

A 9B parameter model runs on a 16 GB consumer device at nearly 20 tok/s.

Ablation Study (Qwen3.5-9B, Q5)

Configuration PPL Δ from FP16 Contribution
FP16 baseline 6.37
Absmax Q5 (baseline) 6.9030 +0.53
+ Hadamard rotation only 6.4010 +0.03 98%
+ Lloyd–Max centroids only 6.9139 +0.54 -2%
+ Both (PolarQuant Q5) 6.3909 +0.02 100%
+ AWQ scales 6.43 +0.06
+ torchao INT4 on top 6.56 +0.19

Hadamard rotation alone accounts for 98% of the quality improvement at Q5. Lloyd–Max centroids provide only marginal additional gain ($\Delta = -0.01$). At $b = 5$ bits (32 levels), the density of levels is sufficient to approximate Gaussian well even with uniformly spaced centroids. Lloyd–Max centroids would contribute more at lower bit widths (e.g., $b = 2$ or $b = 3$), consistent with the 54% MSE reduction at Q3.

Version Evolution

Version Technique PPL Improvement
v1 Absmax 7.26 baseline
v2 + AWQ 7.05 -0.21
v3 + PolarQuant + AWQ 6.43 -0.83
v5 PolarQuant Q5 + torchao 6.56 -0.70

The transition from v1 to v3 reduced the perplexity delta from +0.89 to +0.06, a 93% reduction in quantization-induced quality loss.

Storage and Compression

Format Bits/weight Overhead Total bpw Compression
FP16 16.0 16.0 1.0x
PolarQuant Q5 5.0 0.125 5.125 3.1x
PolarQuant Q5 + AWQ 5.0 0.125 + scales ~5.2 3.1x
PolarQuant Q5 + torchao INT4 4.0 4.0 4.0x
PolarQuant Q4 (MLX) 4.0 0.125 4.125 3.9x

Lloyd–Max Centroid Values

Bits Levels MSE Non-negative Centroids
2 4 0.1175 +0.4528, +1.5104
3 8 0.03454 +0.2451, +0.7560, +1.3440, +2.1520
4 16 0.009497 (computed numerically)
5 32 0.002499 (computed numerically)

Reproduction Guide

Installation

git clone https://github.com/caiovicentino/eoq-quantization.git
cd eoq-quantization
pip install -r requirements.txt

Expected dependencies: PyTorch, torchao, transformers (for model loading), and optionally MLX for Apple Silicon support.

Quantization

The core algorithm requires only the model weights:

  1. Load the target model (e.g., Qwen3.5-9B)
  2. Flatten weight tensors and partition into blocks of size $d = 128$
  3. Compute per-block $\ell_2$ norms $r_i = |b_i|_2$
  4. Normalize: $\hat{b}_i = b_i / r_i$
  5. Construct the $128 \times 128$ normalized Hadamard matrix $H_{128}$
  6. Rotate: $\tilde{b}_i = H_{128} \hat{b}_i$
  7. Scale: $z_i = \sqrt{128} \cdot \tilde{b}_i$
  8. Quantize to nearest Lloyd–Max centroid: $q_{i,j} = \arg\min_k \mid z_{i,j} - c_k \mid$
  9. Save codes, norms, and global centroid table

Dequantization and Inference

  1. Load quantized codes, per-block norms, and centroid table
  2. For each block: look up centroids from codes, scale by $1/\sqrt{128}$, apply $H_{128}$ (self-inverse), multiply by stored norm
  3. Reshape to original weight matrix dimensions
  4. Run inference with standard FP16 or INT4 backend

With AWQ (Optional)

  1. Compute AWQ per-channel scales from calibration data
  2. Apply scales to weights before PolarQuant
  3. Inverse scales after dequantization

With torchao INT4

  1. Dequantize PolarQuant Q5 weights to BF16
  2. Re-quantize with torchao INT4 (group size 128)
  3. Expected result: PPL ~6.56 on Qwen3.5-9B WikiText-2

Verify

  • WikiText-2 perplexity: use sliding window of 2048 tokens, stride 512, mask first 1536 context tokens per window
  • Target: PPL 6.39 for PolarQuant Q5 dequant, 6.56 for PolarQuant Q5 + torchao INT4
  • Pre-trained quantized models available at https://huggingface.co/caiovicentino1

Notes

Key takeaways:

  1. Rotation is the key insight. The ablation conclusively shows that Hadamard rotation accounts for 98% of the quality improvement. Lloyd–Max centroids are a minor refinement at Q5. This simplifies the method to its essential component: a deterministic orthogonal rotation that transforms weight blocks into approximately i.i.d. Gaussian variables.

  2. Intra-block vs. inter-block rotation. Unlike QuaRot/SpinQuant (which rotate between layers, requiring graph surgery) and QuIP# (which rotates entire weight columns), PolarQuant rotates within each 128-element block independently. This requires no model graph modification and is trivially composable.

  3. Self-inverse Hadamard is a free lunch. The Hadamard matrix being its own inverse ($H_d^{-1} = H_d$) means dequantization is equally simple with zero additional parameter storage.

  4. Distributional regularization for cascaded quantization. The finding that PolarQuant Q5 improves downstream torchao INT4 reveals a general principle: the preprocessing quantizer must operate at a sufficiently high bit width to preserve information. Q3 as preprocessing degrades quality (PPL 7.25 vs 6.56) because 8 centroids lose too much signal.

  5. No calibration needed. Unlike GPTQ, AWQ, and most other post-training quantizers, the core PolarQuant algorithm requires no calibration data. Only the optional AWQ combination uses calibration.

Connections to other work:

  • TurboQuant [Ashkboos et al., 2025] is the direct intellectual predecessor, applying the same polar quantization framework to KV cache compression. PolarQuant extends this to weight compression and provides the first ablation quantifying rotation (98%) vs. optimal centroids (2%).
  • QuaRot [Ashkboos et al., NeurIPS 2024] and SpinQuant [Liu et al., ICLR 2025] also use Hadamard rotations but apply them between layers (inter-layer). SpinQuant shows learned rotations can outperform fixed Hadamard by up to 16 points on zero-shot tasks — an interesting direction for future PolarQuant improvements.
  • QuIP# [Chee et al., ICML 2024] shares the use of Hadamard transforms but targets worst-case error bounds via incoherence processing rather than distributional matching.
  • NF4 [Dettmers et al., NeurIPS 2023] also targets Gaussian weight distributions but assumes Gaussianity a priori; PolarQuant explicitly achieves it via rotation.
  • The method is specifically evaluated on Qwen3.5-9B, a hybrid DeltaNet + MoE architecture. The Gaussian approximation may be less precise for architectures with very different weight distributions.

Limitations:

  • Assumption that Hadamard-rotated blocks are well-approximated by i.i.d. Gaussians may not hold for all architectures
  • No exploitation of inter-block correlations
  • Not evaluated on lower-bit regimes (Q2, Q3) as a standalone method or on zero-shot task benchmarks
  • Single model evaluation (Qwen3.5-9B); broader model coverage would strengthen claims