GlowQ: Group-Shared Low-Rank Approximation for Quantized LLMs

problem

low-rank correction methods (LQER, QERA, ASER) recover quantization accuracy by adding $W \approx W_q + AB$ at every layer, but they recompute $BX$ independently per module even when modules share the same input (e.g., Q/K/V projections all ingest the same hidden state). this duplicates expensive matmuls, inflates memory traffic, and adds inference latency. GlowQ addresses this by computing one shared right factor $B_{\text{shared}} X$ per input-sharing group and reusing it across modules.

architecture

group-shared right factor. modules that share the same input (e.g., Q, K, V in attention; gate, up, down in MLP) form a group. GlowQ learns one shared $B_{\text{shared}}$ per group and module-specific left factors ${A_i}$. at inference, compute $R = B_{\text{shared}} X$ once per group, then each module applies $A_i R$.

covariance-aligned objective. a usage-weighted risk $\min_{A,B} |\Sigma^{1/2}(E_{\text{cat}} - AB)|_F^2$ aligns the shared subspace with data-preferred directions (accounting for anisotropic activations). solved via QR-reduced randomized SVD: thin QR compresses the stacked error into a $d \times d$ core, then randomized SVD with oversampling and power iterations extracts dominant right singular vectors.

selective restore (GlowQ-S). not all layers benefit equally from correction. GlowQ-S activates only high-payoff groups using: (1) SVD energy-capture score $|A|_F^2$ per group, (2) normalized error ratio $|E_g|_F / |W_g|_F$, (3) layer-order fallback. this reduces the number of active correction modules, cutting latency.

inference path: quantized weights $W_q$ produce the base output. the cached $R = B_{\text{shared}} X$ is computed once per group and each module adds its $A_i R$ correction.

training

post-training, no gradient training of the base model
calibration data to compute quantization error matrices $E_i = W_i - W_{q,i}$
QR-reduced randomized SVD with oversampling and power iterations for numerical stability
rank $r$ is a hyperparameter (tunable per deployment budget)
tested on Llama2-7B, Llama2-70B, Mistral-7B, Qwen2.5-7B

evaluation

efficiency improvements over per-layer low-rank baselines (LQER, QERA, ASER):

GlowQ: TTFB reduced by 5.6%, throughput increased by 9.6% on average
GlowQ-S (selective): TTFB reduced by 23.4%, throughput increased by 37.4%
accuracy maintained within 0.2 percentage points of full GlowQ

accuracy improvements over 4-bit baselines:

perplexity on WikiText-2 reduced by 0.17%
downstream task accuracy increased by 0.42 percentage points

key advantage: GlowQ computes one BX per group instead of per-module. for a standard transformer, this means 1 BX for the QKV group (instead of 3) and 1 BX for the MLP group (instead of 3), cutting high-precision matmuls roughly in half.

reproduction guide

start from a 4-bit quantized model (GPTQ/AWQ/BnB)
compute per-layer quantization error matrices $E_i = W_i - W_{q,i}$ using calibration data
group modules by shared input (QKV as one group, MLP projections as another)
stack group errors, apply QR-reduced randomized SVD to get shared $B_{\text{shared}}$ and per-module ${A_i}$
for selective variant (GlowQ-S), rank groups by energy-capture score and activate only top-k
expected: reduced TTFB and increased throughput with maintained accuracy
gotchas: the selective variant’s performance depends on choosing the right saliency metric. QR reduction is needed for numerical stability when error matrices are tall

notes

the input-sharing group insight is simple but effective: Q/K/V all see the same hidden state, so why compute BX three times? this is an obvious optimization in hindsight but prior work (LQER, QERA, ASER) all missed it
GlowQ-S’s 37.4% throughput improvement while losing only 0.2 pp accuracy is a strong tradeoff for edge deployment where latency matters more than marginal accuracy gains
for bopi’s edge LLM interests: this directly reduces inference latency of quantized models by eliminating redundant matmuls. stacks with any PTQ method (GPTQ, AWQ, SliderQuant) as a post-processing correction step
the covariance alignment is important: activations are highly anisotropic, so naive SVD on raw errors misallocates rank to rarely-used directions