SliderQuant: Accurate Post-Training Quantization for LLMs via Sliding-Layer Window Design

Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, Anbang Yao et al.

problem

existing PTQ methods (GPTQ, AWQ, SmoothQuant, OmniQuant, CBQ) treat all layers equally in their sequential quantization framework. empirical analysis shows this is suboptimal: shallow and deep layers (especially the first and last) are significantly more sensitive to quantization than intermediate layers. under challenging low-bit settings (W4A4, W2A16), this equal treatment leads to larger accumulated errors. SliderQuant adapts the quantization process to each layer’s sensitivity via an adaptive sliding window framework.

architecture

two components:

inter-layer sliding quantization. three window designs for different layer regions:

shallow layers (first 4): progressively expanded sliding window (PESW), starting from size 1 and growing by 1 per step. the first layer is always included, building dense local-to-global synergies
intermediate layers: fixed-size sliding window {s=2, i=1} with even optimization frequency
deep layers (last 4): progressively contracted sliding window (PCSW), shrinking from all deep layers down to just the last layer

intra-layer sliding quantization. within each window, applies progressively expanded sliding along weight and activation dimensions with ratio $\gamma = 0.5$, completing quantization in 2 stages (first half, then full). builds local-to-global parameter synergy.

quantizer: combines channel scaling (CS) with low-rank adaptation (LoRA, rank r=4):

\[\tilde{X}\_i = X\_i \oslash \alpha\_i, \quad \tilde{W}\_i = W\_i \odot \alpha\_i + A\_i B\_i\]

where $\alpha_i$ is a learnable channel-wise scaling vector. uniform quantizer for both weights and activations. calibration samples: 128.

training

post-training only, no retraining required
128 calibration samples from WikiText2
LoRA rank: 4
layer boundary defaults: $L_s = 4$ shallow, $L_d = 4$ deep
combines with OmniQuant-style learnable parameters per layer

evaluation

W4A16 weight-only (perplexity, lower is better):

model	GPTQ	AWQ	OmniQuant	CBQ	SliderQuant
Llama2-7B (Wiki)	5.83	5.74	5.67	5.61	5.61
Llama2-70B (Wiki)	5.20	5.02	5.02	5.02	5.00
Qwen2.5-14B (Wiki)	6.45	6.54	5.94	5.83	5.80

W4A4 weight-activation (perplexity):

model	SmoothQuant	OmniQuant	CBQ	SliderQuant
Llama2-7B (Wiki)	7.85	6.79	6.56	6.43
Llama2-13B (Wiki)	8.49	7.28	6.93	6.54

W2A16 extreme quantization: SliderQuant achieves 9.59 ppl on Llama2-7B vs 55.0 (RTN), 12.1 (OmniQuant), 37.37 (AWQ). dramatic improvement in the 2-bit regime.

DeepSeek-R1 distilled models: near-lossless 4-bit weight-only quantization on MATH-500, AIME2024, GSM8K, HumanEval+, MBPP+.

MoE (Qwen3-30B-A3B): effective quantization with the sliding window approach.

reproduction guide

clone https://github.com/deep-optimization/SliderQuant
prepare calibration data (128 samples from WikiText2)
run quantization with default settings: Ls=4, Ld=4, LoRA rank=4
for W2A16 extreme quantization, SliderQuant is particularly strong
expected: consistent perplexity improvements over GPTQ/AWQ/OmniQuant/CBQ
gotchas: the sliding window adds quantization time proportional to the overlap, but it’s still PTQ (no gradient training). memory cost scales with window size but the max window is bounded by Ls+Ld+1 layers

notes

the empirical finding that first/last layers are most sensitive to quantization is well-known in CV (first conv, last FC) but hadn’t been systematically exploited for LLM PTQ before
the 2-bit results are remarkable: 9.59 ppl vs 37.37 for AWQ and 12.1 for OmniQuant on Llama2-7B. this makes 2-bit deployment actually viable
for bopi’s interest in LLMs on tiny devices: SliderQuant’s W2A16 and W4A4 results directly improve deployability on edge hardware. the method is PTQ only (no retraining needed), making it very practical
works on MoE models (Qwen3-30B-A3B) and reasoning models (DeepSeek-R1 distilled), which is important for keeping reasoning-capable models small enough for edge deployment