Feature-Gradient Attribution: Enhancing Vision Transformer Explanations with Sparse Autoencoders

TU Wien
Master's Thesis 2025
Defense scheduled for March 2026 — minor revisions may occur
Advisors: Prof. Thomas Lukasiewicz, Dr. Bayar Ilhan Menzat

Abstract

Vision Transformers achieve strong performance in medical and natural imaging, yet their decision processes remain opaque. TransMM, a leading attribution method, combines attention with gradients to highlight influential patches. We introduce Feature-Gradient Attribution, which extends TransMM's principle from attention space to semantic feature space using Sparse Autoencoders (SAEs). SAEs decompose activations into interpretable features; we project gradients onto this feature basis to compute feature-gradient scores capturing both feature presence and influence on predictions. These scores modulate TransMM's attention maps, forming a lightweight, semantically-informed correction.

Across three datasets (chest X-rays, endoscopy, natural images), two architectures (fine-tuned ViT-B/16, CLIP ViT-B/32), and three complementary faithfulness metrics, our method achieves consistent improvements: 10.5–44.3% on SaCo, 14.0–43.0% on Faithfulness Correlation, and 1.3–10.8% on Pixel Flipping. Ablation studies confirm that both feature activations and gradients are necessary. To our knowledge, this is the first integration of sparse semantic features during attribution computation, demonstrating that mechanistic feature structure can materially enhance Transformer attributions.

Method Overview

Standard attribution methods like TransMM identify important patches by combining Attention Value with Attention Gradient. We extend this principle from "token space" to "semantic feature space".

Instead of just asking "where is the model looking?", we project gradients through a Sparse Autoencoder (SAE) to ask "which semantic concepts are influencing the prediction?". We calculate a Feature-Gradient Score by multiplying the activation of a feature by its gradient influence.

Standard TransMM

\( A \odot \frac{\partial y}{\partial A} \)

Attention Value × Gradient

Our Method

\( \bar{f}_t = f_t \odot \frac{\partial y}{\partial f_t} \)

Feature Activation × Gradient


This semantic score creates a gate that modulates the attribution at specific layers:

$$ \text{Attr}_{\ell} = \text{TransMM}_{\ell} \times \text{Gate}(\bar{\mathbf{f}}_t) $$

The resulting gates amplify patches where relevant semantic features are both present (high activation) and influential (high gradient).

Feature-Gradient Attribution Method Overview
Figure 1: System architecture. We extract residual stream activations, decompose them into sparse features using an SAE, and project gradients into this feature basis. The resulting scores modulate TransMM's attention maps before relevancy propagation.

Quantitative Results

We evaluated our method across three diverse datasets using three complementary faithfulness metrics. SAE features consistently outperform both the baseline (Vanilla TransMM) and randomized controls, with statistical significance (\(p < 0.001\)) across most metrics.

Dataset Method SaCo Faithfulness Corr. Pixel Flipping
COVID-QU-Ex
(Chest X-Ray)
Vanilla TransMM 0.459 0.309 93.17
Ours 0.507 (+10.5%) 0.442 (+43.0%) 83.06 (-10.8%)
Hyperkvasir
(Endoscopy)
Vanilla TransMM 0.504 0.479 94.50
Ours 0.727 (+44.3%) 0.550 (+14.8%) 93.31 (-1.3%)
ImageNet
(Natural)
Vanilla TransMM 0.250 0.314 8.83
Ours 0.337 (+34.8%) 0.358 (+14.0%) 8.52 (-3.5%)

↑ Higher is better. ↓ Lower is better. Comparisons against Vanilla TransMM baseline on test sets.

Feature-Gradient Attribution Visualizations

To understand how our method improves faithfulness, we isolated the top 100 images that showed the highest composite improvement across all three metrics (SaCo, Faithfulness Correlation, and Pixel Flipping). We identified patches with the largest attribution magnitude changes (|ΔAttribution|) between baseline and gated models, and extracted the SAE feature contributing most to each change.

While individual SAE feature interpretability remains challenging, with many features exhibiting polysemanticity or abstract activations, our analysis revealed distinct behavioral patterns. By examining prototypical activations (top-10 activating validation images) for these features, we identify strategies the gating mechanism employs to refine attributions.

Our analysis reveals that the method improves attribution via several primary strategies:

  • Suppression of Confounders: Active suppression of high-salience concepts that co-occur with target classes but lack causal relevance (e.g., faces in clothing datasets).
  • Artifact Removal: Filtering non-semantic data artifacts like watermarks, text overlays, or spurious background color biases.
  • Context-Dependent Modulation: Sophisticated context-switching behavior where features act as boosters or suppressors depending on semantic context (e.g., fur texture defining vs. crowding out other features).
  • Broadband Noise Filtering: Late-stage denoising by suppressing low-information patches containing blur, high-frequency edges, or generic visual clutter.

Explore the feature case studies below:

Understanding the Visualizations

Each case study shows attribution comparisons in a three-column format:

  • Left: Original input image
  • Center: Baseline TransMM attribution (without feature gating)
  • Right: Our feature-gated attribution

Red boxes highlight patches where the identified SAE feature drives suppression (reducing attribution), while green boxes indicate amplification (boosting attribution). These show the largest attribution changes that improve faithfulness.

Below each case study grid, prototype analysis shows the top-10 validation images where the feature activates most strongly, with yellow boxes indicating the specific patches within each image that trigger the feature response. This reveals what visual patterns the feature detects.

BibTeX

@mastersthesis{Sula2025FeatureGradient,
  title={Feature-Gradient Attribution: Enhancing Vision Transformer Explanations with Sparse Autoencoders},
  author={Šula, Julius},
  school={TU Wien},
  year={2025},
  url={https://piragi.github.io/thesis}
}