2026-04-02
Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge
Sowmya Vajrala, Aakash Parmar, Prasanna R, Sravanth Kodavanti, Manjunath Arveti, Srinivas Soumitri Miriyala, Ashok Senapati et al.
Problem
Deploying Large Vision Models (LVMs) for generative AI tasks (image editing, object removal, prompt-guided transformation) on resource-constrained edge devices (smartphones) is extremely challenging due to high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, the standard deployment approach is fundamentally flawed for multi-task scenarios:
Core problem: Existing mobile deployment pipelines compile separate model binaries for each LoRA plus a full copy of the foundation model. When deploying multiple GenAI use-cases, this leads to redundant storage, increased runtime overhead, and inability to switch tasks at runtime without recompilation or reloading entire model graphs.
Prior art and limitations:
- QLoRA (Dettmers et al., 2023): Improves memory efficiency during fine-tuning via quantization-aware techniques but assumes static model graphs; does not address multi-LoRA runtime switching on edge NPUs.
- QaLoRA (Xu et al., 2023): Quantization-aware low-rank adaptation for LLMs; not designed for vision diffusion models or NPU deployment.
- MobileDiffusion (Zhao et al., 2024): Reduces diffusion inference cost for mobile devices but builds a single-purpose model from scratch rather than adapting existing LVMs.
- ControlNet (Zhang et al., 2023): Introduces structured conditioning for controllable image editing but adds parallel network branches, increasing model size.
- UniVG (Fu et al., 2025), OneDiffusion (Le et al., 2025), Dual Diffusion (Li et al., 2025): Attempt to unify multiple tasks within a single generative framework via end-to-end multi-task training but do not support modular runtime adaptation or separate adapter switching.
- Multi-LoRA meets Vision (Kesim & Helli, 2024): Merges multiple adapters into a multi-task model but does not address quantization compatibility across LoRAs for edge deployment.
- Conv-Adapter (Chen et al., 2024), AdaptFormer (Chen et al., 2022): Parameter-efficient transfer learning methods but not designed for on-device deployment with dynamic switching.
Key gap: When each LoRA adapter is quantized independently, the resulting adapters require different quantization parameters (scale and zero-point), making them incompatible with a single static inference graph. This prevents efficient runtime task switching, increases memory overhead from multiple calibration states, and complicates NPU deployment where fixed quantization parameters are typically required.
Architecture
Overview
The QUAD (Quantization with Unified Adaptive Distillation) framework has three main components:
- LoRA-as-Input reformulation: Restructure the LVM so LoRA weights are runtime inputs rather than baked into the compiled graph.
- Unified Adaptive Distillation: Align all LoRA weight distributions to share a single quantization profile via knowledge distillation.
- Edge deployment stack: Graph optimization, conversion to hardware-specific IR, and lightweight runtime with dynamic LoRA loading.
Base Model: Latent Diffusion Backbone
The foundation model follows a Stable Diffusion 1.5 architecture. The forward pass is:
\[\hat{x} = D\left(U(z\_t, c)\right), \quad z\_t = E(x) + \epsilon\_t\]where $E$ is the VAE encoder, $U$ is the denoising U-Net backbone, $D$ is the VAE decoder, $x$ is the input image, $\epsilon_t$ is the noise at timestep $t$, $z_t$ is the noisy latent encoding, and $c$ is the conditioning (text prompt or image).
Two model sizes are used:
- 1.1B parameter U-Net (used for 2-use-case evaluation in Table 2)
- 0.7B parameter U-Net (used for 4-use-case evaluation in Table 3)
LoRA Augmentation
For each linear transformation $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ in the U-Net’s transformer and convolution blocks, a LoRA-augmented version is:
\[W\_{\text{LoRA}} = W + \alpha A B\]where $A \in \mathbb{R}^{d_{\text{out}} \times r}$, $B \in \mathbb{R}^{r \times d_{\text{in}}}$, $r \ll \min(d_{\text{out}}, d_{\text{in}})$ is the rank, and $\alpha$ is the scaling factor. During multi-LoRA training, the base model parameters are frozen and separate pairs $(A_i, B_i)$ are trained for each task $i$.
LoRA-as-Input Reformulation
Instead of merging LoRA weights into the model graph, each LoRA-augmented layer is modified to expose additional input nodes for $A$ and $B$. The computation becomes:
\[y = Wx + \alpha A(Bx)\]where $W$ is the frozen weight binary, $x$ and $y$ are input and output feature maps, and $A$, $B$ are supplied as runtime inputs. The model is compiled once; at inference time, different tasks are supported by supplying corresponding LoRA weights on-the-fly.
Unified Quantization Strategy
The quantization setup follows standard affine quantization. For a tensor $T$:
\[s = \frac{T\_{\max} - T\_{\min}}{q\_{\max} - q\_{\min}}\] \[z = \left\lfloor \frac{T\_{\min}}{s} - q\_{\min} \right\rfloor\] \[\hat{T} = \text{clip}\left(\left\lfloor \frac{T - z}{s} \right\rfloor,\; q\_{\min},\; q\_{\max}\right)\] \[T \sim s \cdot \hat{T} + z\]For signed INT-8: $q_{\max} = 127$, $q_{\min} = -128$.
Key innovation: QUAD enforces shared scale $s$ and shared zero-point $z$ across all LoRA weights $A_i$, $B_i$ and the base weight matrix $W$.
Quantization Sensitivity Score (QSS)
To determine which LoRA’s quantization parameters should serve as the anchor, a sensitivity analysis is performed:
\[\text{QSS} = \mathbb{E}\_x \left[ D\left(f(x; w) \parallel f(x; \tilde{w})\right) \right]\]where $f(x; w)$ is the full-precision LoRA output, $f(x; \tilde{w})$ is the quantized LoRA output, and $D(\cdot | \cdot)$ is a divergence metric (e.g., Jensen-Shannon divergence). The LoRA with the highest QSS (most sensitive to quantization) has its quantization parameters adopted as the fixed shared parameters.
Fallback (Unified-LoRA): When all LoRAs are equally sensitive, global quantization parameters are computed from the merged weight distributions of all LoRA adapters.
Knowledge Distillation based Fine-tuning
After determining shared quantization parameters (from the anchor LoRA), the remaining LoRAs are fine-tuned to operate under this unified profile via distillation:
- A QuantSim model is constructed for the LVM with a non-anchor LoRA, where weights are quantized using PTQ encodings derived from the anchor LoRA.
- The full-precision network acts as the teacher, the quantized model as the student.
- LoRA parameters are optimized by minimizing a reconstruction loss between teacher and student outputs, combined with the original LVM training objective.
Through iterative optimization, the LoRA weights are adapted to satisfy the shared quantization parameters while preserving task performance.
Graph Optimization Pipeline
After QUAD processing:
- Model conversion: PyTorch $\to$ ONNX $\to$ hardware-specific IR via vendor toolchains.
- Parallelism introduction: Linear layers mapped to convolution, multi-head attention decomposed into independent heads.
- Static operator fusion: Convolution + activation fused.
- Placeholder inputs added for LoRA weight tensors.
- Quantization nodes inserted using QUAD calibration parameters.
- Scale folding: Quantization scale/zero-point merged into adjacent layers.
- Constant folding: Precomputed constants embedded into graph.
- Dead code elimination: Unused branches removed.
Runtime Stack
Three main components:
- Graph runtime: Executes the quantized model IR.
- LoRA loader: Manages LoRA weight loading and buffer binding.
- Scheduler: Optimizes thread utilization and memory reuse across inference calls.
Supports extensions like LoRA caching, background preloading, and real-time task switching.
Use Cases Evaluated
- Prompt-guided Image Transformation
- Object Removal
- Text-to-Image Generation
- Sketch-to-Image
- Sticker Generation
- Portrait Studio
Training
The paper does not provide exhaustive training hyperparameters (optimizer, learning rate, batch size, etc.) for the initial LoRA fine-tuning phase. However, key training details that can be inferred:
- Base model: Stable Diffusion 1.5 (860M VAE + 1.1B or 0.7B U-Net)
- LoRA training: Standard LoRA fine-tuning with frozen base model; separate pairs $(A_i, B_i)$ per task
- QUAD fine-tuning (distillation phase):
- Teacher: Full-precision (FP32) LVM with original LoRA weights
- Student: Quantized LVM (W8A16 by default) with LoRA weights being fine-tuned
- Loss: Reconstruction loss between teacher/student outputs + original LVM training objective
- Quantization of anchor LoRA: Post-training quantization (PTQ) encodings serve as the shared profile
- Non-anchor LoRAs are quantization-aware fine-tuned under the anchor’s profile
- Hardware for training: x86 server (FP32 baseline)
- No specific training time, epochs, or optimizer details reported
Samplers used at inference:
- OLSS (On-device Latent Stepping Sampler): 8 steps
- LCM (Latent Consistency Model): 4 steps
- ED (Euler Discrete): 4 steps
Evaluation
Accuracy Analysis (FP32 vs INT8 on-device)
Table 1: Prompt-guided Image Transformation (base LoRA, directly quantized)
| Metric | Value |
|---|---|
| $s_{\text{imd}}$ (cosine similarity of direction vectors) | 0.9428 |
| $s_{\text{imimage}}$ (semantic similarity of latents) | 0.881 |
| Structure loss | 0.045 |
| Custom CLIP score | 0.008 |
Table 1: Object Removal (QUAD-aligned, quantized on-device)
| Metric | Value |
|---|---|
| FID | 5.5287 |
| LPIPS | 0.12 |
| SSIM | 0.94 |
| PSNR | 33.04 dB |
On-Device KPIs — 1.1B Model (Table 2)
Qualcomm GS25 — Prompt-guided Image Transformation (OLSS, 8 steps)
| Metric | Value |
|---|---|
| VAE Encoder | 449 ms |
| U-Net | 250 ms |
| VAE Decoder | 839 ms |
| End-to-End | 8826 ms (~8.8s) |
| Shared FM ROM | 119 MB |
| LoRA ROM | 1739 MB |
| Peak RAM | — |
Qualcomm GS25 — Object Removal (LCM, 4 steps)
| Metric | Value |
|---|---|
| VAE Encoder | 458 ms |
| U-Net | 249 ms |
| VAE Decoder | 836 ms |
| End-to-End | 3723 ms (~3.7s) |
| Shared FM ROM | 1375 MB |
| LoRA ROM | 119 MB |
| Peak RAM | 1873 MB |
LSI GS25 — Prompt-guided Image Transformation (OLSS, 8 steps)
| Metric | Value |
|---|---|
| VAE Encoder | 427 ms |
| U-Net | 409 ms |
| VAE Decoder | 896 ms |
| End-to-End | 12456 ms (~12.5s) |
| Shared FM ROM | 134 MB |
| LoRA ROM | 1259 MB |
| Peak RAM | — |
LSI GS25 — Object Removal (ED, 4 steps)
| Metric | Value |
|---|---|
| VAE Encoder | 425 ms |
| U-Net | 175 ms |
| VAE Decoder | 898 ms |
| End-to-End | 4217 ms (~4.2s) |
| Shared FM ROM | 1125 MB |
| LoRA ROM | 104 MB |
| Peak RAM | 1229 MB |
MediaTek Tab S11 — Prompt-guided Image Transformation (OLSS, 8 steps)
| Metric | Value |
|---|---|
| VAE Encoder | 639 ms |
| U-Net | 509 ms |
| VAE Decoder | 1090 ms |
| End-to-End | 15682 ms (~15.7s) |
| Shared FM ROM | 31 MB |
| LoRA ROM | 1590 MB |
| Peak RAM | — |
MediaTek Tab S11 — Object Removal (ED, 4 steps)
| Metric | Value |
|---|---|
| VAE Encoder | 643 ms |
| U-Net | 254 ms |
| VAE Decoder | 1082 ms |
| End-to-End | 5528 ms (~5.5s) |
| Shared FM ROM | 1177 MB |
| LoRA ROM | 87 MB |
| Peak RAM | 1494 MB |
On-Device KPIs — 0.7B Model, 4 Use-Cases (Table 3)
OLSS sampler, 8 steps:
| Use-Case | VAE Enc (ms) | U-Net (ms) | VAE Dec (ms) | E2E (ms) | FM ROM (MB) | LoRA ROM (MB) | Peak RAM (MB) |
|---|---|---|---|---|---|---|---|
| Text-to-Image | — | 48 | 150 | 1052 | 844 | 77 | 2348 |
| Sketch-to-Image | 68 | 48 | 152 | 1527 | 844 | 77 | 2348 |
| Sticker Generation | — | 51 | 149 | 1874 | 844 | 77 | 2348 |
| Portrait Studio | — | 2024 | — | 2019 | 844 | 77 | 2176 |
Notable: The 0.7B model achieves end-to-end latency of ~1.0–1.9s for most tasks, with shared FM ROM of only 844 MB and LoRA ROM of 77 MB per adapter.
Memory Benefit
A latent diffusion model of size 1.4 GB combined with ten LoRA modules (~120 MB each) would require 15 GB when compiled separately. QUAD requires only 2.6 GB (single shared base model + runtime LoRA loading), yielding a 6x reduction in memory footprint.
Latency improvement: Maintaining a single model graph and switching only LoRA weights saves 1.5 seconds in end-to-end latency during inference compared to the traditional per-graph approach.
Ablation: Mixed Precision Quantization (Table 4)
Prompt-guided Image Transformation on GS25 (proposed model uses W8A16):
| W8A8:W8A16 | FID | LPIPS | PSNR (dB) | SSIM |
|---|---|---|---|---|
| 0:100 | 12.23 | 0.108 | 32.71 | 0.981 |
| 20:80 | 13.05 | 0.109 | 32.68 | 0.981 |
| 40:60 | 12.80 | 0.109 | 32.60 | 0.980 |
| 60:40 | 14.28 | 0.113 | 31.41 | 0.978 |
| 80:20 | 26.62 | 0.137 | 28.88 | 0.963 |
| 100:0 | 599.07 | 0.699 | 5.44 | 0.232 |
Conclusion: W8A16 significantly outperforms W8A8. Full W8A8 quantization causes catastrophic quality degradation (FID 599). The proposed W8A16 configuration yields the best FID of 5.53 (from Table 1).
Ablation: Mixed Precision on Tab S11 (Table 5)
Object Removal use-case:
| W8A8:W8A16 | Init Time (ms) | Execute Time (ms) | LoRA ROM (MB) | FID | SSIM | PSNR (dB) |
|---|---|---|---|---|---|---|
| 0:100 | 469 | 1156 | 181 | 25.88 | 0.806 | 25.28 |
| 10:90 | 450 | 1144 | 158 | 27.42 | 0.795 | 25.32 |
| 20:80 | 438 | 1101 | 138 | 27.11 | 0.795 | 25.23 |
| 30:70 | 405 | 1097 | 138 | 27.11 | 0.794 | 25.18 |
LoRA weights in INT8 reduce ROM by 1.5x with marginal accuracy drop.
Reproduction Guide
Note: No public code repository is provided. The following is a general reproduction outline based on the paper’s description.
Step 1: Environment Setup
conda create -n quad python=3.10
conda activate quad
pip install torch torchvision diffusers transformers accelerate
pip install onnx onnxruntime
pip install pillow scipy
Step 2: Prepare Base Model and LoRAs
from diffusers import StableDiffusionPipeline
import torch
# Load SD 1.5 base model
model_id = "runwayml/stable-diffusion-v1-5"
pipeline = StableDiffusionPipeline.from_pretrained(
model_id, torch_dtype=torch.float32
)
# Access U-Net for LoRA insertion
unet = pipeline.unet
Step 3: Restructure for LoRA-as-Input
Modify each LoRA-augmented linear layer to accept $A$ and $B$ as runtime inputs:
\[y = Wx + \alpha A(Bx)\]Replace static LoRA weight injection with dynamic input nodes. Export the model with placeholder inputs for LoRA tensors.
Step 4: Compute Quantization Sensitivity Score
For each LoRA adapter $i$, compute:
\[\text{QSS}\_i = \mathbb{E}\_x \left[ D\_\text{JS}\left(f(x; w\_i) \parallel f(x; \hat{w}\_i)\right) \right]\]using calibration data. Select the LoRA with highest QSS as the anchor.
Step 5: Apply QUAD Distillation
- Quantize the anchor LoRA using PTQ to get shared scale $s$ and zero-point $z$.
- For each non-anchor LoRA:
- Build QuantSim model using anchor’s quantization parameters
- Train with distillation loss: $\mathcal{L}_\text{total} = \mathcal{L}_\text{recon}(f_\text{teacher}, f_\text{student}) + \mathcal{L}_\text{LVM}$
- Fine-tune only LoRA parameters $A_i$, $B_i$
Step 6: Export and Compile
# Export to ONNX with LoRA placeholder inputs
python export_onnx.py --model quad_model --output model.onnx
# Convert to vendor-specific IR (e.g., Qualcomm SNPE, MediaTek NeuroPilot)
# Apply graph optimizations: scale folding, constant folding, dead code elimination
Step 7: Deploy and Verify
- Load compiled model on target device
- At runtime, bind LoRA weights to input slots via lightweight API
- Switch tasks by supplying different LoRA weights without recompilation
- Verify with metrics: FID, LPIPS, SSIM, PSNR against FP32 server baseline
Verification Checks
- Confirm single compiled binary serves all LoRA tasks
- Measure shared ROM usage vs. separate compilation approach
- Profile end-to-end latency across chipsets
- Compare visual quality with FP32 baseline
Notes
Key Takeaways
-
Runtime LoRA injection is the critical architectural insight: By treating LoRA weights as runtime inputs to a frozen, pre-compiled graph, the entire deployment paradigm changes. This is a systems-level contribution that is simple in concept but requires careful integration with NPU toolchains.
-
Quantization compatibility is the bottleneck: The paper correctly identifies that independently quantized LoRAs produce incompatible quantization profiles, which is the main barrier to single-graph multi-LoRA deployment on NPUs. The QUAD solution aligns distributions via distillation rather than retraining.
-
W8A16 is the sweet spot for quality: Full W8A8 quantization of activations causes catastrophic degradation (FID jumps from ~5 to ~600). Weight-only INT8 with INT16 activations preserves quality while still providing significant memory savings.
-
Chipset-agnostic approach: The framework works across Qualcomm, LSI/Exynos, and MediaTek NPUs, which is critical for real-world mobile deployment where devices use different SoCs.
-
Practical Samsung deployment: This is an industry paper from Samsung Research with real on-device numbers across Galaxy S25 and Tab S11, not just server-side simulation.
Connections to Other Work
- QLoRA / QaLoRA: QUAD extends the quantization-aware LoRA paradigm from training efficiency to deployment efficiency, specifically targeting the multi-LoRA compatibility problem that QLoRA does not address.
- MobileDiffusion: While MobileDiffusion designs a smaller model for mobile, QUAD takes existing LVMs and makes them deployable — a more pragmatic approach for leveraging pre-trained models.
- AdapterFusion / Multi-LoRA merging: Works like Multi-LoRA meets Vision merge adapters at the weight level. QUAD takes the orthogonal approach of keeping adapters separate at runtime but unified in quantization parameters.
- TensorRT-LLM dynamic batching: The LoRA-as-input pattern parallels dynamic batching approaches in LLM serving, where different LoRA adapters are loaded into GPU memory slots. QUAD adapts this concept for mobile NPUs.
- Compressor/quantization surveys (Zheng 2025): The challenges of diffusion model edge deployment are well-documented; QUAD provides a concrete solution for the specific problem of modular adapter deployment.
Limitations
- No public code release; full reproduction from paper description alone would be non-trivial.
- Training hyperparameters (optimizer, lr, epochs, batch size) for the QUAD distillation phase are not reported.
- The approach is validated on SD 1.5; applicability to newer architectures (SDXL, Flux) is not demonstrated.
- The 0.7B model Portrait Studio row shows U-Net time of 2024 ms — likely an error or anomalous measurement not explained.
- Only 2–4 use-cases demonstrated; the claimed 6x memory benefit requires 10+ use-cases.