ITQ3_S: Interleaved Ternary Quantization with TurboQuant — High-Fidelity 3-bit LLM Inference via Rotation-Domain Adaptive Quantization

Problem

The paper targets the fundamental challenge of running large language models (70B+ parameters) on consumer-grade GPUs using 3-bit weight quantization — the regime often called the “breaking point” for LLM fidelity. A 70B-parameter model in FP16 requires ~140 GiB of memory; even the NVIDIA RTX 5090 with 32 GiB of VRAM cannot load it without aggressive quantization.

Two root causes of 3-bit quality collapse:

Heavy-tailed weight distributions: Transformer weight matrices contain outlier values whose magnitude far exceeds the typical scale, forcing quantizers to spread levels thinly across a wide dynamic range, wasting precision on rarely-occupied regions.
Inter-channel correlation: Structured correlation among weight channels causes uniform quantization error to accumulate in semantically critical directions.

Prior methods and their limitations:

GPTQ (Frantar et al., ICLR 2023): Uses second-order Hessian correction for weight rounding, but is not purpose-built for 3-bit on consumer GPUs with fused CUDA kernel constraints.
AWQ (Lin et al., 2023): Applies activation-aware per-channel scaling; effective at 4-bit but lacks the rotation needed to handle outliers at 3-bit.
SqueezeLLM (Kim et al., ICML 2024): Uses sparse outlier coding (Fisher information pruning) to isolate outliers; introduces complexity in the inference path.
QuIP# (Tseng et al., ICML 2024): Applies random orthogonal rotations (Kronecker products of Hadamard matrices) to “incoherify” weights before quantization. Random rotations require storing a random seed and reconstructing the rotation at inference time, adding latency. Naively composing QuIP# rotation with existing weight quantizers — applying rotation only to KV cache while leaving weights in the original domain — introduces a systematic domain mismatch whose errors accumulate across transformer layers.
LLM.int8() (Dettmers et al., NeurIPS 2022): Splits computation into FP16 (for outlier channels) and INT8 (for normal channels); requires masked scatter-gather operations which are expensive on consumer GPUs.
SpQR (Dettmers et al., ICLR 2024): Stores a small number of outlier weights in higher precision alongside a low-bit compressed tensor; the mixed-precision storage complicates kernel design.
IQ3_S (llama.cpp baseline): Existing 3-bit ternary format without rotation; suffers 0.89 perplexity gap vs FP16 on LLaMA-3 8B.

The key gap: TurboQuant (TQ) establishes the theoretical foundation for FWHT-based rotation but lacks a native CUDA kernel implementation, precluding direct deployment. No existing method codesigns the rotation and ternary quantization as a single unified pipeline with fused CUDA kernel support.

Architecture

Overview

ITQ3_S (Interleaved Ternary Quantization – Specialized) combines a deterministic Fast Walsh-Hadamard Transform (FWHT) pre-rotation with ternary quantization into a unified 3-bit weight format, with the inverse transform fused into the CUDA MMQ kernel. The core pipeline is:

\[\text{Encode}(w) = \text{Pack}_{3b}\left(\text{Clamp}\left(\text{round}\left(\frac{Hw}{d\_k} + z\_k + 0.5\right), -1, 1\right)\right), \quad d\_k, z\_k\]

where $H = H_{256}$ is the normalized 256-point Walsh-Hadamard matrix.

Walsh-Hadamard Transform

The WHT of a vector $v \in \mathbb{R}^n$ (where $n = 2^k$) is:

\[\hat{v} = H\_n v, \qquad H\_n = \frac{1}{\sqrt{n}} \begin{pmatrix} H\_{n/2} & H\_{n/2} \\ H\_{n/2} & -H\_{n/2} \end{pmatrix}\]

The WHT is self-inverse: $H_n^{-1} = H_n$, since $H_n H_n = I$ for the normalized form. The Fast WHT runs in $O(n \log n)$ via butterfly operations:

\[(u, v) \mapsto (u + v, \; u - v)\]

applied across $\log_2 n$ stages.

Theoretical Foundation

Theorem 1 (Distribution Smoothing): Let $w \in \mathbb{R}^n$ be a weight vector with empirical mean $\mu$ and variance $\sigma^2$, and let $w’ = H_n w$ be its Walsh-Hadamard transform. If the entries of $w$ are independent with bounded $\ell_4$ norm, then by the Central Limit Theorem for Walsh transforms, the entries of $w’$ converge in distribution to $\mathcal{N}(0, \sigma^2)$ as $n \to \infty$.

Corollary 1 (Outlier Suppression): For $n = 256$, the expected $\ell_\infty$ reduction factor is approximately $|w|_\infty / (\sigma \log 256) \approx |w|_\infty / (3\sigma)$. A single large outlier $w_j = M \gg \sigma$ contributes only $M / \sqrt{n}$ to each transformed coefficient, distributing its energy uniformly.

Theorem 2 (Reconstruction Bound): For any weight vector $w \in \mathbb{R}^{256}$, the per-element reconstruction error is bounded by:

\[\|\hat{w} - w\|\_2^2 \leq \frac{d\_k^2}{4} \cdot n + \epsilon\_{\text{FWHT}}\]

where $d_k$ is the ternary quantization scale per block and $\epsilon_{\text{FWHT}}$ is the floating-point rounding error of the 256-point IFWHT (at most $O(n \cdot \log n \cdot u)$ for machine epsilon $u$). The isometric property of $H$ ensures the FWHT rotation does not increase the quantization error norm; its benefit lies entirely in reducing $d_k$.

Optimal Ternary Scale

For Gaussian-distributed input $x \sim \mathcal{N}(0, \sigma^2)$, the MSE-optimal ternary threshold is:

\[\alpha^* = \sigma \sqrt{2} \cdot \text{erfinv}\left(\sqrt{\frac{2}{3}}\right) \approx 0.798\,\sigma\]

After the FWHT, entries of $Hw$ are approximately $\mathcal{N}(0, \sigma^2)$, so $\alpha^*$ is computed directly from the empirical standard deviation — no expensive second-order Hessian computation needed.

Block Structure and Memory Layout

Weights are organized into blocks of $n = 256$ elements, aligned to the FWHT transform unit:

Component	Size
Quants (256 × 3 bits)	96 bytes
Scale ($d_k$, FP16)	2 bytes
Zero-point ($z_k$, FP16)	2 bytes (optional)
Sub-block scales (8 × FP16)	16 bytes (optional)

Total per 256 weights: 100 bytes ⇒ 3.125 bits/weight. With sub-block scales: 116 bytes ⇒ 3.625 bits/weight.

Interleaved Packing

Each ternary value $q \in {0, 1, 2}$ (representing ${-1, 0, +1}$ with zero-point $z = 1$) is encoded in 3 bits. Two 4-bit nibble streams are interleaved to form 32-bit words aligned for DP4A:

\[\text{word}\_i = \sum\_{j=0}^{7} (q\_{8i+j \bmod 4}) \ll 4j \quad \text{(for even } i\text{)}\]

The high bit of each 4-bit nibble encodes the interleave selector, enabling reconstruction via a single 32-bit load and bitfield extraction, maximizing L1 cache utilization.

CUDA Kernel Design (TurboQuant)

The core contribution is fusing the 256-point Inverse FWHT into the shared-memory loading stage of the MMQ kernel, so dequantized weights are never materialized in global memory:

Load: Fetch interleaved 3-bit quants from global memory into registers
Unpack: Bitfield-extract ternary values $\tilde{q}_j \in {-1, 0, 1}$ per thread
Dequantize: $v_j \leftarrow d_k \cdot (\tilde{q}_j - z_k)$
Write $v_j$ to shared memory smem_fwht[j]
Synchronize: syncthreads()
8 butterfly stages for step ∈ {1, 2, 4, …, 128}:
- lo ← j mod (2·step); hi ← lo + step
- smem_fwht[lo] ← u + v; smem_fwht[hi] ← u − v
- Synchronize after each stage
Normalize: multiply by $1/\sqrt{256} = 0.0625$
Proceed to matrix multiplication using shared memory as weight tile

The normalization factor is the only arithmetic overhead over standard IQ3_S dequantization (2.1% overhead measured).

For the MMVQ path (autoregressive generation, batch size $B=1$), warp-level 32-point FWHT approximations use __shfl_xor_sync instructions, with full 256-point fidelity available via cooperative thread groups when shared memory allows.

Hardware Target

NVIDIA RTX 5090 (Blackwell SM 100): 192 KB shared memory per SM, 1024 threads per SM, 4096 INT8 MACs/clock/SM via DP4A, 32 GiB GDDR7 at 1792 GB/s bandwidth. Blocks aligned to 128-byte cache lines (100 bytes < 128 bytes) to eliminate false sharing.

Training

ITQ3_S is a post-training quantization (PTQ) method — no training or fine-tuning is required. The offline quantization pipeline operates as follows:

For each block $w \in \mathbb{R}^{256}$ in the weight tensor $W \in \mathbb{R}^{M \times N}$:
- Apply FWHT: $w’ \leftarrow \text{FWHT}(w)$
- Compute optimal scale: $d_k \leftarrow \alpha^*(\sigma(w’))$
- Compute zero-point: $z_k \leftarrow -\text{round}(\mu(w’) / d_k)$
- Quantize: $q \leftarrow \text{Clamp}(\text{round}(w’ / d_k) + z_k, -1, 1)$
- Store: $\text{Pack}_{3b}(q), d_k, z_k$

No optimizer, learning rate, or training loop. The entire process is deterministic and runs in a single forward pass over the weight tensor.

Evaluation

Experimental Setup

Hardware: NVIDIA RTX 5090 (Blackwell SM 100, 32 GiB GDDR7, 1792 GB/s)
Models: LLaMA-3 8B, LLaMA-3 70B (sharded), Mistral 7B v0.3, Qwen2.5 32B
Baselines: FP16, Q8_0, Q4_K_M (GGUF), IQ3_S (llama.cpp), IQ4_XS, QuIP#-3bit
Metrics: WikiText-2 perplexity, C4 perplexity, tokens/sec (prefill and decode), memory footprint

Perplexity Results (WikiText-2, LLaMA-3 8B)

Method	Bits/Weight	PPL ↓	ΔPPL vs. FP16	Mem (GiB)
FP16 (baseline)	16.0	6.14	—	15.0
Q8_0	8.0	6.16	+0.02	7.5
Q4_K_M	4.5	6.35	+0.21	4.8
IQ4_XS	4.3	6.41	+0.27	4.1
IQ3_S (baseline 3-bit)	3.5	7.03	+0.89	3.4
QuIP#-3bit	3.0	6.78	+0.64	3.0
ITQ3_S (ours)	3.125	6.52	+0.38	3.1

Key findings:

ITQ3_S reduces the perplexity gap to FP16 by 57% compared to IQ3_S (0.38 vs. 0.89)
ITQ3_S outperforms QuIP#-3bit by 0.26 perplexity points
At 3.125 bits/weight, ITQ3_S achieves lower perplexity than Q4_K_M at 4.5 bits (6.52 vs. 6.35 is within 0.17 points while using 30% fewer bits)

Throughput Results (RTX 5090, LLaMA-3 8B)

Method	Decode (tok/s)	Prefill (tok/s)	Speedup vs. FP16
FP16	480	28,400	1.0×
Q4_K_M	890	42,100	1.9×
IQ3_S	1,020	47,800	2.1×
ITQ3_S	960	51,200	2.0× / 1.8×

The IFWHT overhead reduces decode throughput slightly vs. IQ3_S (960 vs. 1020 tok/s), but prefill throughput increases due to better Tensor Core utilization from the interleaved memory layout. ITQ3_S delivers 1.5× the prefill throughput of 4-bit alternatives.

FWHT Block Size Ablation

Block Size	PPL ↓	Overhead (%)
32	6.81	0.3
64	6.67	0.7
128	6.59	1.4
256 (ITQ3_S)	6.52	2.1
512	6.51	4.8

$n = 256$ achieves the best quality-efficiency tradeoff. Going to $n = 512$ improves PPL by only 0.01 but increases IFWHT overhead by 2.3×.

70B Model Scaling

LLaMA-3 70B at 3.125 bits/weight requires ≈ 27.3 GiB, fitting within the RTX 5090’s 32 GiB with 4.7 GiB to spare for KV cache at ~16K context tokens. The paper claims this is the first demonstration of a 70B-class model running at full single-GPU throughput on consumer hardware without model sharding.

Reproduction Guide

No public code repository was available at the time of publication. The following steps outline the expected reproduction path based on the paper’s description:

1. Obtain and Build llama.cpp with ITQ3_S Support

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Apply ITQ3_S kernel patches (pending upstream merge)
cmake -B build -DLLAMA_CUDA=ON -DLLAMA_FP16_CUDA=ON
cmake --build build --config Release -j$(nproc)

2. Quantize a Model to ITQ3_S

# Download model (e.g., LLaMA-3 8B)
# Quantize with ITQ3_S format
./build/bin/llama-quantize \
    --type itq3_s \
    models/Meta-Llama-3-8B-FP16.gguf \
    models/Meta-Llama-3-8B-ITQ3S.gguf

The quantization pipeline per block:

Load 256 weights
Apply 256-point FWHT (deterministic, no random seed)
Compute block scale $d_k = 0.798 \cdot \sigma(w’)$
Compute zero-point $z_k = -\text{round}(\mu(w’) / d_k)$
Clamp round values to {-1, 0, +1}
Pack into interleaved 3-bit format

3. Verify Perplexity

./build/bin/llama-perplexity \
    -m models/Meta-Llama-3-8B-ITQ3S.gguf \
    -f data/wikitext-2-test.txt \
    --log-disable

# Expected: PPL ≈ 6.52 on WikiText-2
# Compare against:
#   FP16 baseline: PPL ≈ 6.14
#   IQ3_S baseline: PPL ≈ 7.03

4. Benchmark Throughput

./build/bin/llama-bench \
    -m models/Meta-Llama-3-8B-ITQ3S.gguf \
    -t 1 -p 512 -n 128 \
    --log-disable

# Expected decode: ~960 tok/s, prefill: ~51,200 tok/s (RTX 5090)

5. Verify 70B Fits in 32 GiB

./build/bin/llama-server \
    -m models/Meta-Llama-3-70B-ITQ3S.gguf \
    -c 16384 -ngl 99 \
    --log-disable

# Expected memory: ~27.3 GiB for weights + ~4.7 GiB for KV cache

Notes

Key Takeaways

FWHT rotation is the secret sauce. By pre-rotating weights via the deterministic 256-point Walsh-Hadamard Transform, ITQ3_S smooths heavy-tailed distributions into near-Gaussian form, making ternary quantization far more effective. The key theoretical result: the rotation is isometric ($|Hv|_2 = |v|_2$), so it does not increase error — it only reduces the optimal quantization scale.
Deterministic > random for practical deployment. Unlike QuIP# which uses random orthogonal rotations requiring seed storage and runtime reconstruction, ITQ3_S applies the same fixed $H_{256}$ to every block. For $n \leq 256$, the theoretical difference is negligible, but the practical benefit is enormous — complete kernel fusion with no extra memory traffic.
3.125 bits beats 4.5 bits in quality-per-bit. ITQ3_S at 3.125 bits/weight achieves PPL 6.52 vs. Q4_K_M’s 6.35, a gap of only 0.17 while using 30% less memory. This is significant for edge deployment where every bit matters.
70B on a single consumer GPU is now feasible. At 27.3 GiB for weights with 4.7 GiB KV cache headroom, this opens the door to running frontier-class models without multi-GPU setups or cloud infrastructure.

Connections to Other Work

QuIP# (Tseng et al., ICML 2024): ITQ3_S adapts the core insight of rotation-based incoherence from QuIP# but trades random Kronecker-product rotations for a deterministic, hardware-matched FWHT. The paper explicitly addresses the “domain mismatch” problem that arises when naively composing QuIP#’s rotation with existing weight quantizers.
IQ3_S (llama.cpp/ggml): ITQ3_S builds directly on the IQ3_S ternary format, extending it with pre-rotation. The memory layout and packing scheme are designed to be a drop-in extension.
GPTQ/AWQ (post-training quantization): ITQ3_S avoids second-order Hessian-based methods entirely. The FWHT smoothing eliminates the need for activation-aware scaling because the outlier problem is addressed at the distribution level rather than through per-channel correction.
TurboQuant (TQ): The theoretical FWHT-based rotation strategy. ITQ3_S is the practical realization of TQ, providing the CUDA kernel implementation that TQ lacked.
Walsh-Hadamard Transform (Hadamard, 1893): The mathematical foundation dates back to the 19th century. Its self-inverse property ($H^2 = I$) and $O(n \log n)$ computational complexity make it uniquely suited for quantization-dequantization pipelines.

Limitations

Weights only: No activation quantization; combining with 8-bit activation quantization could further reduce bandwidth.
Post-training only: No QAT integration; training-aware quantization could recover additional accuracy.
Power-of-two block size constraint: Hidden dimensions not divisible by 256 require padding strategies whose impact needs further study.
Single-author paper with no public code: Reproducibility depends on community adoption into llama.cpp.
No multi-model comprehensive benchmarks: Results are shown for LLaMA-3 8B in detail; 70B, Mistral 7B, and Qwen2.5 32B are mentioned but detailed numbers are not provided.