Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

Sowmya Vajrala, Aakash Parmar, Prasanna R, Sravanth Kodavanti, Manjunath Arveti, Srinivas Soumitri Miriyala, Ashok Senapati et al.

Problem

Deploying Large Vision Models (LVMs) for generative AI tasks (image editing, object removal, prompt-guided transformation) on resource-constrained edge devices (smartphones) is extremely challenging due to high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, the standard deployment approach is fundamentally flawed for multi-task scenarios:

Core problem: Existing mobile deployment pipelines compile separate model binaries for each LoRA plus a full copy of the foundation model. When deploying multiple GenAI use-cases, this leads to redundant storage, increased runtime overhead, and inability to switch tasks at runtime without recompilation or reloading entire model graphs.

Prior art and limitations:

QLoRA (Dettmers et al., 2023): Improves memory efficiency during fine-tuning via quantization-aware techniques but assumes static model graphs; does not address multi-LoRA runtime switching on edge NPUs.
QaLoRA (Xu et al., 2023): Quantization-aware low-rank adaptation for LLMs; not designed for vision diffusion models or NPU deployment.
MobileDiffusion (Zhao et al., 2024): Reduces diffusion inference cost for mobile devices but builds a single-purpose model from scratch rather than adapting existing LVMs.
ControlNet (Zhang et al., 2023): Introduces structured conditioning for controllable image editing but adds parallel network branches, increasing model size.
UniVG (Fu et al., 2025), OneDiffusion (Le et al., 2025), Dual Diffusion (Li et al., 2025): Attempt to unify multiple tasks within a single generative framework via end-to-end multi-task training but do not support modular runtime adaptation or separate adapter switching.
Multi-LoRA meets Vision (Kesim & Helli, 2024): Merges multiple adapters into a multi-task model but does not address quantization compatibility across LoRAs for edge deployment.
Conv-Adapter (Chen et al., 2024), AdaptFormer (Chen et al., 2022): Parameter-efficient transfer learning methods but not designed for on-device deployment with dynamic switching.

Key gap: When each LoRA adapter is quantized independently, the resulting adapters require different quantization parameters (scale and zero-point), making them incompatible with a single static inference graph. This prevents efficient runtime task switching, increases memory overhead from multiple calibration states, and complicates NPU deployment where fixed quantization parameters are typically required.

Architecture

Overview

The QUAD (Quantization with Unified Adaptive Distillation) framework has three main components:

LoRA-as-Input reformulation: Restructure the LVM so LoRA weights are runtime inputs rather than baked into the compiled graph.
Unified Adaptive Distillation: Align all LoRA weight distributions to share a single quantization profile via knowledge distillation.
Edge deployment stack: Graph optimization, conversion to hardware-specific IR, and lightweight runtime with dynamic LoRA loading.

Base Model: Latent Diffusion Backbone

The foundation model follows a Stable Diffusion 1.5 architecture. The forward pass is:

\[\hat{x} = D\left(U(z\_t, c)\right), \quad z\_t = E(x) + \epsilon\_t\]

where $E$ is the VAE encoder, $U$ is the denoising U-Net backbone, $D$ is the VAE decoder, $x$ is the input image, $\epsilon_t$ is the noise at timestep $t$, $z_t$ is the noisy latent encoding, and $c$ is the conditioning (text prompt or image).

Two model sizes are used:

1.1B parameter U-Net (used for 2-use-case evaluation in Table 2)
0.7B parameter U-Net (used for 4-use-case evaluation in Table 3)

LoRA Augmentation

For each linear transformation $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ in the U-Net’s transformer and convolution blocks, a LoRA-augmented version is:

\[W\_{\text{LoRA}} = W + \alpha A B\]

where $A \in \mathbb{R}^{d_{\text{out}} \times r}$, $B \in \mathbb{R}^{r \times d_{\text{in}}}$, $r \ll \min(d_{\text{out}}, d_{\text{in}})$ is the rank, and $\alpha$ is the scaling factor. During multi-LoRA training, the base model parameters are frozen and separate pairs $(A_i, B_i)$ are trained for each task $i$.

LoRA-as-Input Reformulation

Instead of merging LoRA weights into the model graph, each LoRA-augmented layer is modified to expose additional input nodes for $A$ and $B$. The computation becomes:

\[y = Wx + \alpha A(Bx)\]

where $W$ is the frozen weight binary, $x$ and $y$ are input and output feature maps, and $A$, $B$ are supplied as runtime inputs. The model is compiled once; at inference time, different tasks are supported by supplying corresponding LoRA weights on-the-fly.

Unified Quantization Strategy

The quantization setup follows standard affine quantization. For a tensor $T$:

\[s = \frac{T\_{\max} - T\_{\min}}{q\_{\max} - q\_{\min}}\] \[z = \left\lfloor \frac{T\_{\min}}{s} - q\_{\min} \right\rfloor\] \[\hat{T} = \text{clip}\left(\left\lfloor \frac{T - z}{s} \right\rfloor,\; q\_{\min},\; q\_{\max}\right)\] \[T \sim s \cdot \hat{T} + z\]

For signed INT-8: $q_{\max} = 127$, $q_{\min} = -128$.

Key innovation: QUAD enforces shared scale $s$ and shared zero-point $z$ across all LoRA weights $A_i$, $B_i$ and the base weight matrix $W$.

Quantization Sensitivity Score (QSS)

To determine which LoRA’s quantization parameters should serve as the anchor, a sensitivity analysis is performed:

\[\text{QSS} = \mathbb{E}\_x \left[ D\left(f(x; w) \parallel f(x; \tilde{w})\right) \right]\]

where $f(x; w)$ is the full-precision LoRA output, $f(x; \tilde{w})$ is the quantized LoRA output, and $D(\cdot | \cdot)$ is a divergence metric (e.g., Jensen-Shannon divergence). The LoRA with the highest QSS (most sensitive to quantization) has its quantization parameters adopted as the fixed shared parameters.

Fallback (Unified-LoRA): When all LoRAs are equally sensitive, global quantization parameters are computed from the merged weight distributions of all LoRA adapters.

Knowledge Distillation based Fine-tuning

After determining shared quantization parameters (from the anchor LoRA), the remaining LoRAs are fine-tuned to operate under this unified profile via distillation:

A QuantSim model is constructed for the LVM with a non-anchor LoRA, where weights are quantized using PTQ encodings derived from the anchor LoRA.
The full-precision network acts as the teacher, the quantized model as the student.
LoRA parameters are optimized by minimizing a reconstruction loss between teacher and student outputs, combined with the original LVM training objective.

Through iterative optimization, the LoRA weights are adapted to satisfy the shared quantization parameters while preserving task performance.

Graph Optimization Pipeline

After QUAD processing:

Model conversion: PyTorch $\to$ ONNX $\to$ hardware-specific IR via vendor toolchains.
Parallelism introduction: Linear layers mapped to convolution, multi-head attention decomposed into independent heads.
Static operator fusion: Convolution + activation fused.
Placeholder inputs added for LoRA weight tensors.
Quantization nodes inserted using QUAD calibration parameters.
Scale folding: Quantization scale/zero-point merged into adjacent layers.
Constant folding: Precomputed constants embedded into graph.
Dead code elimination: Unused branches removed.

Runtime Stack

Three main components:

Graph runtime: Executes the quantized model IR.
LoRA loader: Manages LoRA weight loading and buffer binding.
Scheduler: Optimizes thread utilization and memory reuse across inference calls.

Supports extensions like LoRA caching, background preloading, and real-time task switching.

Use Cases Evaluated

Prompt-guided Image Transformation
Object Removal
Text-to-Image Generation
Sketch-to-Image
Sticker Generation
Portrait Studio

Training

The paper does not provide exhaustive training hyperparameters (optimizer, learning rate, batch size, etc.) for the initial LoRA fine-tuning phase. However, key training details that can be inferred:

Base model: Stable Diffusion 1.5 (860M VAE + 1.1B or 0.7B U-Net)
LoRA training: Standard LoRA fine-tuning with frozen base model; separate pairs $(A_i, B_i)$ per task
QUAD fine-tuning (distillation phase):
- Teacher: Full-precision (FP32) LVM with original LoRA weights
- Student: Quantized LVM (W8A16 by default) with LoRA weights being fine-tuned
- Loss: Reconstruction loss between teacher/student outputs + original LVM training objective
- Quantization of anchor LoRA: Post-training quantization (PTQ) encodings serve as the shared profile
- Non-anchor LoRAs are quantization-aware fine-tuned under the anchor’s profile
Hardware for training: x86 server (FP32 baseline)
No specific training time, epochs, or optimizer details reported

Samplers used at inference:

OLSS (On-device Latent Stepping Sampler): 8 steps
LCM (Latent Consistency Model): 4 steps
ED (Euler Discrete): 4 steps

Evaluation

Accuracy Analysis (FP32 vs INT8 on-device)

Table 1: Prompt-guided Image Transformation (base LoRA, directly quantized)

Metric	Value
$s_{\text{imd}}$ (cosine similarity of direction vectors)	0.9428
$s_{\text{imimage}}$ (semantic similarity of latents)	0.881
Structure loss	0.045
Custom CLIP score	0.008

Table 1: Object Removal (QUAD-aligned, quantized on-device)

Metric	Value
FID	5.5287
LPIPS	0.12
SSIM	0.94
PSNR	33.04 dB

On-Device KPIs — 1.1B Model (Table 2)

Qualcomm GS25 — Prompt-guided Image Transformation (OLSS, 8 steps)

Metric	Value
VAE Encoder	449 ms
U-Net	250 ms
VAE Decoder	839 ms
End-to-End	8826 ms (~8.8s)
Shared FM ROM	119 MB
LoRA ROM	1739 MB
Peak RAM	—

Qualcomm GS25 — Object Removal (LCM, 4 steps)

Metric	Value
VAE Encoder	458 ms
U-Net	249 ms
VAE Decoder	836 ms
End-to-End	3723 ms (~3.7s)
Shared FM ROM	1375 MB
LoRA ROM	119 MB
Peak RAM	1873 MB

LSI GS25 — Prompt-guided Image Transformation (OLSS, 8 steps)

Metric	Value
VAE Encoder	427 ms
U-Net	409 ms
VAE Decoder	896 ms
End-to-End	12456 ms (~12.5s)
Shared FM ROM	134 MB
LoRA ROM	1259 MB
Peak RAM	—

LSI GS25 — Object Removal (ED, 4 steps)

Metric	Value
VAE Encoder	425 ms
U-Net	175 ms
VAE Decoder	898 ms
End-to-End	4217 ms (~4.2s)
Shared FM ROM	1125 MB
LoRA ROM	104 MB
Peak RAM	1229 MB

MediaTek Tab S11 — Prompt-guided Image Transformation (OLSS, 8 steps)

Metric	Value
VAE Encoder	639 ms
U-Net	509 ms
VAE Decoder	1090 ms
End-to-End	15682 ms (~15.7s)
Shared FM ROM	31 MB
LoRA ROM	1590 MB
Peak RAM	—

MediaTek Tab S11 — Object Removal (ED, 4 steps)

Metric	Value
VAE Encoder	643 ms
U-Net	254 ms
VAE Decoder	1082 ms
End-to-End	5528 ms (~5.5s)
Shared FM ROM	1177 MB
LoRA ROM	87 MB
Peak RAM	1494 MB

On-Device KPIs — 0.7B Model, 4 Use-Cases (Table 3)

OLSS sampler, 8 steps:

Use-Case	VAE Enc (ms)	U-Net (ms)	VAE Dec (ms)	E2E (ms)	FM ROM (MB)	LoRA ROM (MB)	Peak RAM (MB)
Text-to-Image	—	48	150	1052	844	77	2348
Sketch-to-Image	68	48	152	1527	844	77	2348
Sticker Generation	—	51	149	1874	844	77	2348
Portrait Studio	—	2024	—	2019	844	77	2176

Notable: The 0.7B model achieves end-to-end latency of ~1.0–1.9s for most tasks, with shared FM ROM of only 844 MB and LoRA ROM of 77 MB per adapter.

Memory Benefit

A latent diffusion model of size 1.4 GB combined with ten LoRA modules (~120 MB each) would require 15 GB when compiled separately. QUAD requires only 2.6 GB (single shared base model + runtime LoRA loading), yielding a 6x reduction in memory footprint.

Latency improvement: Maintaining a single model graph and switching only LoRA weights saves 1.5 seconds in end-to-end latency during inference compared to the traditional per-graph approach.

Ablation: Mixed Precision Quantization (Table 4)

Prompt-guided Image Transformation on GS25 (proposed model uses W8A16):

W8A8:W8A16	FID	LPIPS	PSNR (dB)	SSIM
0:100	12.23	0.108	32.71	0.981
20:80	13.05	0.109	32.68	0.981
40:60	12.80	0.109	32.60	0.980
60:40	14.28	0.113	31.41	0.978
80:20	26.62	0.137	28.88	0.963
100:0	599.07	0.699	5.44	0.232

Conclusion: W8A16 significantly outperforms W8A8. Full W8A8 quantization causes catastrophic quality degradation (FID 599). The proposed W8A16 configuration yields the best FID of 5.53 (from Table 1).

Ablation: Mixed Precision on Tab S11 (Table 5)

Object Removal use-case:

W8A8:W8A16	Init Time (ms)	Execute Time (ms)	LoRA ROM (MB)	FID	SSIM	PSNR (dB)
0:100	469	1156	181	25.88	0.806	25.28
10:90	450	1144	158	27.42	0.795	25.32
20:80	438	1101	138	27.11	0.795	25.23
30:70	405	1097	138	27.11	0.794	25.18

LoRA weights in INT8 reduce ROM by 1.5x with marginal accuracy drop.

Reproduction Guide

Note: No public code repository is provided. The following is a general reproduction outline based on the paper’s description.

Step 1: Environment Setup

conda create -n quad python=3.10
conda activate quad
pip install torch torchvision diffusers transformers accelerate
pip install onnx onnxruntime
pip install pillow scipy

Step 2: Prepare Base Model and LoRAs

from diffusers import StableDiffusionPipeline
import torch

# Load SD 1.5 base model
model_id = "runwayml/stable-diffusion-v1-5"
pipeline = StableDiffusionPipeline.from_pretrained(
    model_id, torch_dtype=torch.float32
)

# Access U-Net for LoRA insertion
unet = pipeline.unet

Step 3: Restructure for LoRA-as-Input

Modify each LoRA-augmented linear layer to accept $A$ and $B$ as runtime inputs:

\[y = Wx + \alpha A(Bx)\]

Replace static LoRA weight injection with dynamic input nodes. Export the model with placeholder inputs for LoRA tensors.

Step 4: Compute Quantization Sensitivity Score

For each LoRA adapter $i$, compute:

\[\text{QSS}\_i = \mathbb{E}\_x \left[ D\_\text{JS}\left(f(x; w\_i) \parallel f(x; \hat{w}\_i)\right) \right]\]

using calibration data. Select the LoRA with highest QSS as the anchor.

Step 5: Apply QUAD Distillation

Quantize the anchor LoRA using PTQ to get shared scale $s$ and zero-point $z$.
For each non-anchor LoRA:
- Build QuantSim model using anchor’s quantization parameters
- Train with distillation loss: $\mathcal{L}_\text{total} = \mathcal{L}_\text{recon}(f_\text{teacher}, f_\text{student}) + \mathcal{L}_\text{LVM}$
- Fine-tune only LoRA parameters $A_i$, $B_i$

Step 6: Export and Compile

# Export to ONNX with LoRA placeholder inputs
python export_onnx.py --model quad_model --output model.onnx

# Convert to vendor-specific IR (e.g., Qualcomm SNPE, MediaTek NeuroPilot)
# Apply graph optimizations: scale folding, constant folding, dead code elimination

Step 7: Deploy and Verify

Load compiled model on target device
At runtime, bind LoRA weights to input slots via lightweight API
Switch tasks by supplying different LoRA weights without recompilation
Verify with metrics: FID, LPIPS, SSIM, PSNR against FP32 server baseline

Verification Checks

Confirm single compiled binary serves all LoRA tasks
Measure shared ROM usage vs. separate compilation approach
Profile end-to-end latency across chipsets
Compare visual quality with FP32 baseline

Notes

Key Takeaways

Runtime LoRA injection is the critical architectural insight: By treating LoRA weights as runtime inputs to a frozen, pre-compiled graph, the entire deployment paradigm changes. This is a systems-level contribution that is simple in concept but requires careful integration with NPU toolchains.
Quantization compatibility is the bottleneck: The paper correctly identifies that independently quantized LoRAs produce incompatible quantization profiles, which is the main barrier to single-graph multi-LoRA deployment on NPUs. The QUAD solution aligns distributions via distillation rather than retraining.
W8A16 is the sweet spot for quality: Full W8A8 quantization of activations causes catastrophic degradation (FID jumps from ~5 to ~600). Weight-only INT8 with INT16 activations preserves quality while still providing significant memory savings.
Chipset-agnostic approach: The framework works across Qualcomm, LSI/Exynos, and MediaTek NPUs, which is critical for real-world mobile deployment where devices use different SoCs.
Practical Samsung deployment: This is an industry paper from Samsung Research with real on-device numbers across Galaxy S25 and Tab S11, not just server-side simulation.

Connections to Other Work

QLoRA / QaLoRA: QUAD extends the quantization-aware LoRA paradigm from training efficiency to deployment efficiency, specifically targeting the multi-LoRA compatibility problem that QLoRA does not address.
MobileDiffusion: While MobileDiffusion designs a smaller model for mobile, QUAD takes existing LVMs and makes them deployable — a more pragmatic approach for leveraging pre-trained models.
AdapterFusion / Multi-LoRA merging: Works like Multi-LoRA meets Vision merge adapters at the weight level. QUAD takes the orthogonal approach of keeping adapters separate at runtime but unified in quantization parameters.
TensorRT-LLM dynamic batching: The LoRA-as-input pattern parallels dynamic batching approaches in LLM serving, where different LoRA adapters are loaded into GPU memory slots. QUAD adapts this concept for mobile NPUs.
Compressor/quantization surveys (Zheng 2025): The challenges of diffusion model edge deployment are well-documented; QUAD provides a concrete solution for the specific problem of modular adapter deployment.

Limitations

No public code release; full reproduction from paper description alone would be non-trivial.
Training hyperparameters (optimizer, lr, epochs, batch size) for the QUAD distillation phase are not reported.
The approach is validated on SD 1.5; applicability to newer architectures (SDXL, Flux) is not demonstrated.
The 0.7B model Portrait Studio row shows U-Net time of 2024 ms — likely an error or anomalous measurement not explained.
Only 2–4 use-cases demonstrated; the claimed 6x memory benefit requires 10+ use-cases.