2026-03-30
SliderQuant: Accurate Post-Training Quantization for LLMs via Sliding-Layer Window Design
Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, Anbang Yao et al.
problem
existing PTQ methods (GPTQ, AWQ, SmoothQuant, OmniQuant, CBQ) treat all layers equally in their sequential quantization framework. empirical analysis shows this is suboptimal: shallow and deep layers (especially the first and last) are significantly more sensitive to quantization than intermediate layers. under challenging low-bit settings (W4A4, W2A16), this equal treatment leads to larger accumulated errors. SliderQuant adapts the quantization process to each layer’s sensitivity via an adaptive sliding window framework.
architecture
two components:
inter-layer sliding quantization. three window designs for different layer regions:
- shallow layers (first 4): progressively expanded sliding window (PESW), starting from size 1 and growing by 1 per step. the first layer is always included, building dense local-to-global synergies
- intermediate layers: fixed-size sliding window {s=2, i=1} with even optimization frequency
- deep layers (last 4): progressively contracted sliding window (PCSW), shrinking from all deep layers down to just the last layer
intra-layer sliding quantization. within each window, applies progressively expanded sliding along weight and activation dimensions with ratio $\gamma = 0.5$, completing quantization in 2 stages (first half, then full). builds local-to-global parameter synergy.
quantizer: combines channel scaling (CS) with low-rank adaptation (LoRA, rank r=4):
\[\tilde{X}\_i = X\_i \oslash \alpha\_i, \quad \tilde{W}\_i = W\_i \odot \alpha\_i + A\_i B\_i\]where $\alpha_i$ is a learnable channel-wise scaling vector. uniform quantizer for both weights and activations. calibration samples: 128.
training
- post-training only, no retraining required
- 128 calibration samples from WikiText2
- LoRA rank: 4
- layer boundary defaults: $L_s = 4$ shallow, $L_d = 4$ deep
- combines with OmniQuant-style learnable parameters per layer
evaluation
W4A16 weight-only (perplexity, lower is better):
| model | GPTQ | AWQ | OmniQuant | CBQ | SliderQuant |
|---|---|---|---|---|---|
| Llama2-7B (Wiki) | 5.83 | 5.74 | 5.67 | 5.61 | 5.61 |
| Llama2-70B (Wiki) | 5.20 | 5.02 | 5.02 | 5.02 | 5.00 |
| Qwen2.5-14B (Wiki) | 6.45 | 6.54 | 5.94 | 5.83 | 5.80 |
W4A4 weight-activation (perplexity):
| model | SmoothQuant | OmniQuant | CBQ | SliderQuant |
|---|---|---|---|---|
| Llama2-7B (Wiki) | 7.85 | 6.79 | 6.56 | 6.43 |
| Llama2-13B (Wiki) | 8.49 | 7.28 | 6.93 | 6.54 |
W2A16 extreme quantization: SliderQuant achieves 9.59 ppl on Llama2-7B vs 55.0 (RTN), 12.1 (OmniQuant), 37.37 (AWQ). dramatic improvement in the 2-bit regime.
DeepSeek-R1 distilled models: near-lossless 4-bit weight-only quantization on MATH-500, AIME2024, GSM8K, HumanEval+, MBPP+.
MoE (Qwen3-30B-A3B): effective quantization with the sliding window approach.
reproduction guide
- clone https://github.com/deep-optimization/SliderQuant
- prepare calibration data (128 samples from WikiText2)
- run quantization with default settings: Ls=4, Ld=4, LoRA rank=4
- for W2A16 extreme quantization, SliderQuant is particularly strong
- expected: consistent perplexity improvements over GPTQ/AWQ/OmniQuant/CBQ
- gotchas: the sliding window adds quantization time proportional to the overlap, but it’s still PTQ (no gradient training). memory cost scales with window size but the max window is bounded by Ls+Ld+1 layers
notes
- the empirical finding that first/last layers are most sensitive to quantization is well-known in CV (first conv, last FC) but hadn’t been systematically exploited for LLM PTQ before
- the 2-bit results are remarkable: 9.59 ppl vs 37.37 for AWQ and 12.1 for OmniQuant on Llama2-7B. this makes 2-bit deployment actually viable
- for bopi’s interest in LLMs on tiny devices: SliderQuant’s W2A16 and W4A4 results directly improve deployability on edge hardware. the method is PTQ only (no retraining needed), making it very practical
- works on MoE models (Qwen3-30B-A3B) and reasoning models (DeepSeek-R1 distilled), which is important for keeping reasoning-capable models small enough for edge deployment