Steerable Visual Representations

Problem

Pretrained ViTs like DINOv2, MAE, and SigLIP produce query-agnostic features that collapse to the most salient object (e.g., a “cat” in a scene), with no mechanism to steer toward less prominent concepts like “remote control” or “bookshelf.” Existing multimodal approaches have fundamental limitations:

Unimodal ViTs (DINOv2, MAE): rich visual features but zero steerability; DINOv2 achieves only 43.7% on text-guided retrieval vs. 96% needed.
Cross-modal encoders (CLIP, SigLIP): text provides training supervision only — the visual encoder cannot be steered at inference. Post-hoc late fusion (element-wise addition of text to frozen visual features) yields a negligible 0.02% boost.
FLAIR (Xiao et al., CVPR 2025): applies text-conditioned attention pooling over frozen SigLIP (late fusion), achieving 81.3% steerability but underperforming unimodal encoders on visual benchmarks.
MLLMs (InternVL3, Qwen3-VL, LFM-2.5-VL): offer moderate steerability but produce language-centric representations with diminished visual fidelity, require $\geq$1B parameters, and fuse text inside the LLM rather than the visual encoder.
OV localization (SAM3, GroundingDINO): highly steerable but features are specialized for localization and lack generality for classification/retrieval/segmentation.

No existing approach satisfies all three desiderata: (1) text steerability, (2) visual representation quality, (3) early vision-language fusion inside the visual encoder.

Architecture

SteerViT interleaves lightweight gated cross-attention layers inside frozen ViT blocks. Only ~21.2M trainable parameters are added (no FFN); both the visual and text encoders remain frozen. The architecture has four components:

A. Visual encoder. A frozen ViT (default: DINOv2 ViT-B/14) producing $N$ patch tokens $Z_v \in \mathbb{R}^{N \times d_v}$. Also tested on SigLIP ViT-B/16 and MAE ViT-B/16.

B. Text encoder. Frozen RoBERTa-Large (355M params) producing token embeddings $Z_t \in \mathbb{R}^{L \times d_t}$ for a conditioning prompt $X_t$.

C. Multimodal adapter. Each $Z_t$ is $\ell_2$-normalized then projected through a trainable 2-layer MLP into visual-aligned space $H_t \in \mathbb{R}^{L \times d_v}$. Replacing this MLP with a single linear layer drops FG-CLS by 1.0 and PODS by 1.7 points (Tab. 4).

D. Gated cross-attention layers. Inserted into every other ViT block (6 CA layers for 12-block ViT-B). Vision patch tokens $Z_v^{(\ell)}$ are queries; adapted text tokens $H_t$ are keys and values — the inverse of Flamingo’s language→vision formulation:

\[\hat{Z}_v^{(\ell)} = \mathrm{CA}(Z_v^{(\ell)}, H_t) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V, \quad Q = Z_v^{(\ell)} W_Q, \; K = H_t W_K, \; V = H_t W_V\]

Output is integrated via a tanh gate with a per-layer learnable scalar $\alpha_\ell$, initialized to zero:

\[Z_v^{(\ell+1)} = Z_v^{(\ell)} + \tanh(\alpha_\ell) \cdot \hat{Z}_v^{(\ell)}\]

At initialization $\tanh(0) = 0$, so the model is identical to the frozen ViT. The gradient $\partial Z_v^{(\ell+1)} / \partial \alpha_\ell = \mathrm{sech}^2(\alpha_\ell) \cdot \hat{Z}_v^{(\ell)}$ is non-zero since $\mathrm{sech}^2(0) = 1$, allowing $\alpha_\ell$ to move away from zero during optimization. At inference, scaling all $\alpha_\ell$ by a factor $\omega \in [0,1]$ provides a continuous control knob; optimal operating point is $\omega = 0.6$ for DINOv2 and SigLIP.

The gated FFN from Flamingo is intentionally omitted — it adds 67% more parameters (35.4M vs. 21.2M) and consistently hurts steerability and OOD transfer (e.g., MAE CORE drops 7.2 points with FFN; Tab. 11).

Training

Objective. Patch-level referential segmentation. Given an image $X_v$ and referring expression $X_t$, a linear classifier maps each patch token to a foreground probability via softmax. Ground-truth $y_i$ is the fraction of foreground pixels within each patch on a binary segmentation mask. Training uses soft cross-entropy:

\[\mathcal{L} = -\sum_{i=1}^{n \times n} y_i \log p_i\]

Segmentation masks are generated by SAM2 conditioned on ground-truth bounding boxes. Segmentation supervision outperforms a Gaussian-pointing alternative by +7.3 FG-CLS, +8.0 ADE20k, +12.4 PODS (Tab. 9), because it teaches both where and what the object looks like.

Data. A mixture of 5 datasets: RefCOCO (137k pairs), RefCOCO+ (136k), RefCOCOg (91k), Visual Genome (50k images, 1445k pairs), LVIS (368k images), and Mapillary Vistas (17k images, 97k pairs) — totaling 162k unique images and 2.28M image-text pairs.

Training details.

Resolution: $336 \times 336$
Batch size: 12
Optimizer: AdamW with cosine schedule
LR: warm up to $3 \times 10^{-4}$ over 5k steps, decay to $3 \times 10^{-5}$ by 40k steps, constant thereafter
Duration: 500k iterations (~84 H100 GPU-hours)
Steerability emerges rapidly: CORE reaches 95.3% within 50k iterations. Deeper language understanding (PODS: 49.9→58.1, RefCOCOg: 63.4→70.6) continues improving through 450k iterations.

Evaluation

CORE Benchmark (text-guided image retrieval)

New benchmark: 100 images × 6 scenes from SUN397, 5 inpainted objects per scene via FLUX.2 (3000 images total). One-vs-all retrieval measuring whether a model can steer global features from scene-level similarity to a specified non-salient object.

Method	CORE acc@1
Random	20.0%
MAE	21.8%
DINOv2	43.7%
CLIP	44.2%
FLAIR	81.3%
InternVL3-1B	47.0%
InternVL3-2B	76.0%
Qwen3-VL-2B	69.7%
SAM3	93.3%
SteerViT	96.0%

Conditioning on a random (incorrect) class drops SteerViT by -47.7 points, confirming steerability is genuinely text-driven (CLIP/SigLIP show zero sensitivity).

GeneCIS (real-world conditional retrieval)

On the Focus Object split: SteerViT achieves 25.4% R@1 vs. 9.6% for DINOv2 and 18.7% for the specialized baseline.

MOSAIC (targeted attention)

4-image mosaics from PASCAL-VOC. DINOv2 PR-AUC: 14.3%; SteerViT: 50.2%.

Representation quality preservation

Linear probing on ImageWoof, Waterbirds, StanfordCars + binary segmentation on ADE20k. SteerViT fully preserves DINOv2’s representation quality (both score ~87.7 FG-CLS). At $\omega = 0.6$, both slightly exceed the original ViT’s quality. For MAE, quality monotonically improves with $\omega$ (40→50 points).

PODS (personalized object discrimination)

Text specificity controls feature granularity. Coarse prompts (“mug”): 27.9% PR-AUC. Descriptive MLLM-generated prompts: 58.1% PR-AUC (surpassing task-specific fine-tuned DINOv2 at 48.0%). Single SteerViT model replaces 100 task-specific fine-tuned models.

Zero-shot anomaly segmentation (MVTec AD / VisA)

No anomaly-specific training. Conditioned on “the anomaly in the <object>.” Ensemble of 10 prompts, averaged heatmaps.

Method	MVTec PRO	VisA PRO
SAM3	54.5	65.9
WinCLIP	64.6	59.8
FADE (specialist)	84.5	79.3
SteerViT	82.1	82.0

Backbone generalization

Backbone	Base CORE	Late Fusion CORE	SteerViT CORE
DINOv2	43.7	93.3	96.0
SigLIP	38.3	75.4	91.3
MAE	21.8	41.0	74.9

Early fusion gains are largest for weaker backbones: +33.9 for MAE, +15.9 for SigLIP, +2.7 for DINOv2.

Reproduction Guide

Requirements.

PyTorch with CUDA (H100 recommended; ~84 GPU-hours for full training)
Pretrained weights: DINOv2 ViT-B/14, RoBERTa-Large
Training datasets: RefCOCO/+/g, Visual Genome (MDETR-preprocessed), LVIS, Mapillary Vistas

Key training configuration.

# Hyperparameters
resolution = 336
batch_size = 12
iterations = 500_000
optimizer = "AdamW"
lr_start = 3e-4          # after 5k warmup steps
lr_end = 3e-5            # reached at 40k steps
scheduler = "cosine"     # constant after 40k steps
ca_insertion = "every_other_block"  # 6 CA layers for ViT-B
gate_init = 0.0          # tanh(alpha) starts at identity
ffn = False              # omit Flamingo-style FFN
text_projector = "2-layer-MLP"

Segmentation mask preparation.

Convert ground-truth bounding boxes to binary masks using SAM2
Project pixel masks onto the ViT’s $n \times n$ patch grid (at 336px, ViT-B/14 gives $24 \times 24$)
Soft targets: fraction of foreground pixels per patch

Inference with gate scaling.

# Scale all alpha values by omega for steerability-quality tradeoff
omega = 0.6  # optimal for DINOv2/SigLIP
for ca_layer in cross_attention_layers:
    ca_layer.alpha_scaled = ca_layer.alpha * omega

Evaluation commands (inferred from paper).

# CORE benchmark: encode images with text prompts, compute top-1 retrieval accuracy
# GeneCIS: evaluate on Focus Object split with R@1, R@2, R@3
# PODS: compute PR-AUC and NDCG on frozen features with descriptive prompts
# MOSAIC: stitch PASCAL-VOC images, compute PR-AUC of CLS-to-patch attention vs. GT masks
# Anomaly segmentation (MVTec): prompt with "the anomaly in the <object>", ensemble 10 prompts, upsample patch heatmaps, compute PRO/ROC/F1_max

Project page: jonaruthardt.github.io/project/SteerViT

Notes

Connection to Mechanistic Interpretability and Concept Vectors

SteerViT is a direct operationalization of the idea that text can serve as a steering vector for visual representations — closely paralleling work on contrastive activation averaging and concept vectors in LLM interpretability. Key connections:

Concept vectors via text. The gated cross-attention layers effectively compute text-conditioned perturbations to the ViT’s residual stream: $\Delta Z_v^{(\ell)} = \tanh(\alpha_\ell) \cdot \hat{Z}_v^{(\ell)}$. This is structurally analogous to activation steering in LLMs, where a concept vector is added to intermediate activations. The gate $\alpha_\ell$ plays the role of a steering coefficient that can be tuned at inference.
Zero-init as identity preservation. Initializing $\alpha_\ell = 0$ ensures the perturbation starts as the zero vector and is gradually learned — a safety property analogous to activation addition starting from zero magnitude. The gradient $\mathrm{sech}^2(0) = 1$ guarantees the gate receives learning signal from the start.
Early fusion vs. late fusion is critical. The paper demonstrates empirically that post-hoc feature manipulation (late fusion, adding text embeddings to frozen visual features) provides negligible steerability (+0.02% for CLIP). This mirrors findings in mechanistic interpretability that adding concept vectors to final-layer representations is less effective than intervening at intermediate layers. Early fusion (injecting text at layers 1, 3, 5, 7, 9, 11) allows steering to propagate through the network’s processing hierarchy.
Embedding space reorganization. The UMAP visualizations (Fig. 9) show SteerViT can restructure the embedding topology along semantic hierarchies (“animal” merges animal classes), specific categories (“bird” separates birds), and compositional attributes (“eye” groups all animate classes). This is essentially attribute-level activation steering applied to vision — conditioning on a concept reshapes the representation manifold.
Steerability is text-dependent, not artifact. The random-prompt ablation (Tab. 7) shows SteerViT drops 47.7 points when given incorrect text, while CLIP/SigLIP show zero sensitivity. This is the visual analog of showing that steering vectors have direction-dependent effects in LLMs.
Scalability and efficiency. At 21.2M parameters (vs. $\geq$1B for MLLMs), SteerViT demonstrates that lightweight cross-attention adapters are sufficient for rich vision-language steering — suggesting similar adapter-based architectures could enable efficient concept steering in other modalities.

Key Limitations

Requires paired image-text data for the referential segmentation pretext task (2.28M pairs from existing datasets, not zero-shot).
Best results on DINOv2 backbone; MAE and SigLIP start from weaker baselines.
Anomaly segmentation cannot predict certain defect types (e.g., flipped metal nut) with zero-shot approach.
No code released as of publication.