Feature-Gradient Attribution: Enhancing Vision Transformer Explanations with Sparse Autoencoders

Šula, Julius

Feature-Gradient Attribution: Enhancing Vision Transformer Explanations with Sparse Autoencoders

Julius Šula

TU Wien
Master's Thesis 2025

Defense scheduled for March 2026 — minor revisions may occur

Advisors: Prof. Thomas Lukasiewicz, Dr. Bayar Ilhan Menzat

Thesis PDF Code

Abstract

Vision Transformers achieve strong performance in medical and natural imaging, yet their decision processes remain opaque. TransMM, a leading attribution method, combines attention with gradients to highlight influential patches. We introduce Feature-Gradient Attribution, which extends TransMM's principle from attention space to semantic feature space using Sparse Autoencoders (SAEs). SAEs decompose activations into interpretable features; we project gradients onto this feature basis to compute feature-gradient scores capturing both feature presence and influence on predictions. These scores modulate TransMM's attention maps, forming a lightweight, semantically-informed correction.

Across three datasets (chest X-rays, endoscopy, natural images), two architectures (fine-tuned ViT-B/16, CLIP ViT-B/32), and three complementary faithfulness metrics, our method achieves consistent improvements: 10.5–44.3% on SaCo, 14.0–43.0% on Faithfulness Correlation, and 1.3–10.8% on Pixel Flipping. Ablation studies confirm that both feature activations and gradients are necessary. To our knowledge, this is the first integration of sparse semantic features during attribution computation, demonstrating that mechanistic feature structure can materially enhance Transformer attributions.

Method Overview

Standard attribution methods like TransMM identify important patches by combining Attention Value with Attention Gradient. We extend this principle from "token space" to "semantic feature space".

Instead of just asking "where is the model looking?", we project gradients through a Sparse Autoencoder (SAE) to ask "which semantic concepts are influencing the prediction?". We calculate a Feature-Gradient Score by multiplying the activation of a feature by its gradient influence.

Standard TransMM

$ A \odot \frac{\partial y}{\partial A} $

Attention Value × Gradient

Our Method

$ \bar{f}_t = f_t \odot \frac{\partial y}{\partial f_t} $

Feature Activation × Gradient

This semantic score creates a gate that modulates the attribution at specific layers:

$$ \text{Attr}_{\ell} = \text{TransMM}_{\ell} \times \text{Gate}(\bar{\mathbf{f}}_t) $$

The resulting gates amplify patches where relevant semantic features are both present (high activation) and influential (high gradient).

Feature-Gradient Attribution Method Overview — **Figure 1:** System architecture. We extract residual stream activations, decompose them into sparse features using an SAE, and project gradients into this feature basis. The resulting scores modulate TransMM's attention maps before relevancy propagation.

Quantitative Results

We evaluated our method across three diverse datasets using three complementary faithfulness metrics. SAE features consistently outperform both the baseline (Vanilla TransMM) and randomized controls, with statistical significance ($p < 0.001$) across most metrics.

Dataset	Method	SaCo	Faithfulness Corr.	Pixel Flipping
COVID-QU-Ex (Chest X-Ray)	Vanilla TransMM	0.459	0.309	93.17
COVID-QU-Ex (Chest X-Ray)	Ours	0.507 (+10.5%)	0.442 (+43.0%)	83.06 (-10.8%)
Hyperkvasir (Endoscopy)	Vanilla TransMM	0.504	0.479	94.50
Hyperkvasir (Endoscopy)	Ours	0.727 (+44.3%)	0.550 (+14.8%)	93.31 (-1.3%)
ImageNet (Natural)	Vanilla TransMM	0.250	0.314	8.83
ImageNet (Natural)	Ours	0.337 (+34.8%)	0.358 (+14.0%)	8.52 (-3.5%)

↑ Higher is better. ↓ Lower is better. Comparisons against Vanilla TransMM baseline on test sets.

Feature-Gradient Attribution Visualizations

To understand how our method improves faithfulness, we isolated the top 100 images that showed the highest composite improvement across all three metrics (SaCo, Faithfulness Correlation, and Pixel Flipping). We identified patches with the largest attribution magnitude changes (|ΔAttribution|) between baseline and gated models, and extracted the SAE feature contributing most to each change.

While individual SAE feature interpretability remains challenging, with many features exhibiting polysemanticity or abstract activations, our analysis revealed distinct behavioral patterns. By examining prototypical activations (top-10 activating validation images) for these features, we identify strategies the gating mechanism employs to refine attributions.

Our analysis reveals that the method improves attribution via several primary strategies:

Suppression of Confounders: Active suppression of high-salience concepts that co-occur with target classes but lack causal relevance (e.g., faces in clothing datasets).
Artifact Removal: Filtering non-semantic data artifacts like watermarks, text overlays, or spurious background color biases.
Context-Dependent Modulation: Sophisticated context-switching behavior where features act as boosters or suppressors depending on semantic context (e.g., fur texture defining vs. crowding out other features).
Broadband Noise Filtering: Late-stage denoising by suppressing low-information patches containing blur, high-frequency edges, or generic visual clutter.

Explore the feature case studies below:

Understanding the Visualizations

Each case study shows attribution comparisons in a three-column format:

Left: Original input image
Center: Baseline TransMM attribution (without feature gating)
Right: Our feature-gated attribution

Red boxes highlight patches where the identified SAE feature drives suppression (reducing attribution), while green boxes indicate amplification (boosting attribution). These show the largest attribution changes that improve faithfulness.

Below each case study grid, prototype analysis shows the top-10 validation images where the feature activates most strongly, with yellow boxes indicating the specific patches within each image that trigger the feature response. This reveals what visual patterns the feature detects.

1 / 0

Feature 10968, Layer 6 - Facial Detector

Suppression of Confounding Concepts. This feature activates on facial landmarks (eyes and nose regions) in both humans and animals. In case studies, it consistently appears with negative attribution contributions (suppressor) in classes such as Academic Gown, Cloak, or Golden Retriever. While faces are visually prominent, they are confounders for clothing classification. The gating mechanism utilizes gradients of this feature to down-weight attention on persons, redirecting focus toward relevant objects. This aligns with negative semantic steering, where the model explicitly identifies and subtracts irrelevant concepts. Note that Layer 6 features do not distinguish between animal and human faces.

Feature 20254, Layer 6 - Fur Texture Detector

Context-Dependent Feature Modulation. This feature detects animal fur textures and exhibits sophisticated context-switching behavior. In images of Schipperke or Border Collies, where fur defines the class, this feature boosts attribution (shown with green boxes). Conversely, in images of Pekinese, the same feature suppresses attribution (shown with red boxes). This modulation suggests the gating mechanism is not a static filter but adapts to context. In suppression cases, the model may dampen generic fur textures to resolve feature crowding, forcing attribution to rely on more discriminative features such as facial geometry to distinguish similar dog breeds.

Feature 37778, Layer 6 - Text & Watermark Detector

Mitigation of Dataset Artifacts. This feature activates strongly on text overlays and watermarks. Feature-gradient mechanisms aid in filtering non-semantic data artifacts. In analyzed samples, this feature suppressed patches containing copyright labels, probably preventing the model from exploiting metadata text as prediction shortcuts. By removing these artificial textual elements, the gating mechanism ensures attributions focus on actual image content rather than spurious correlations with dataset-specific artifacts.

Feature 45529, Layer 9 - Human Face Context Detector

Suppression of Confounding Concepts (Deeper Layers). This feature operates similarly to the Layer 6 facial detector (Feature 10968) but specifically targets human facial structures and presence in deeper layers. It consistently appears with negative attribution contributions in classes such as Cloak, Cowboy Hat, and other clothing or object categories. While human faces are visually prominent, they are confounders when the task requires classifying worn objects. This feature demonstrates that the hierarchical nature of vision transformers is preserved in the gating mechanism, with deeper layers performing more refined semantic suppression of human-specific context.

Feature 5436, Layer 10 - Broadband Noise Detector

Broadband Noise Filtering. This feature appears across diverse unrelated classes, with prototypes displaying high-frequency edges, blur, or generic gradients. Its ubiquity and consistent suppressive role suggest it functions as a broadband background filter. By suppressing low-information patches, the gating mechanism performs late-stage denoising, cleaning attribution maps of visual clutter. This demonstrates that some SAE features capture not semantic concepts but rather statistical properties of "non-informative" regions, enabling the model to focus on semantically meaningful areas regardless of the specific class being predicted.

Feature 6561, Layer 9 - Red Texture Detector

Mitigation of Dataset Artifacts (Color Bias). This feature detects specific red textures and backgrounds. Feature-gradient mechanisms aid in filtering spurious color correlations. Its suppression in classes like Hamper or Perfume suggests removal of background color biases (e.g., red blankets or velvet fabric) that correlate with but do not cause object classes. In classes like Bell Pepper or Flamingo, it de-emphasizes regions providing minimal additional context beyond the red color itself. This demonstrates the method's ability to identify and suppress dataset-specific color biases that could lead to spurious shortcuts.

BibTeX

@mastersthesis{Sula2025FeatureGradient,
  title={Feature-Gradient Attribution: Enhancing Vision Transformer Explanations with Sparse Autoencoders},
  author={Šula, Julius},
  school={TU Wien},
  year={2025},
  url={https://piragi.github.io/thesis}
}