paper · 2026

트랜스포머 XAI를 위한 QKV 분해

가중치만으로 트랜스포머 예측 실패를 진단하고, 한 레이어 재학습으로 교정. GPT-2 수도 정확도 2/8 → 8/8, 부작용 0, V-only Wv 슬라이스(590K params)로도 가능.

What this paper does

We present a method for diagnosing transformer prediction errors and surgically correcting them through Q/K/V weight analysis. By decomposing attention head weights into query functions, key responses, and value channels, failure causes can be identified without running any input through the model.

Applied to GPT-2: factual knowledge (e.g., France→Paris) emerges at layer 10 head 8 (+25.8 logit contribution) and is subsequently reversed at layer 12 head 0 (+149.9 for “the”). Targeted retraining of only the diagnosed layer recovers knowledge accuracy from 2/8 to 8/8 capitals with zero side effects (general capability 11/15 maintained, PPL 42.7 → 42.6).

Layer-by-layer cumulative logit for Paris vs the
Knowledge path: Paris emerges at L10 H8, gets reversed at L12 H0.

The model already possesses this knowledge internally; the failure is one of routing, not absence.

Before and after L10-only retraining: 2/8 → 8/8 capitals
L10-only retraining: 2/8 → 8/8 capitals, zero side effects.

Why it matters

Routing correction can be achieved through attention V, FFN, or even V-only (Wv slice, 590K parameters — 0.5% of GPT-2). Knowledge routing is not confined to FFN layers, contrary to the common interpretation behind ROME and similar methods.

Four pathways all reach 8/8 capitals correct
All four training configurations reach 8/8. V-only (590K params) suffices.

This opens correction pathways beyond MLP-only model editing.

Static diagnostics (input-free)

The method also classifies all 144 attention heads in GPT-2 small directly from weights — no input pass required:

QK overlap heatmap across all 144 attention heads
QK overlap (top-100 dimension intersection). High values = self-similar matching; low = heterogeneous.
Wk response by token type across layers
K-norm by token type. Proper nouns have the highest response across all layers — yet the model fails to retrieve facts about them. The bottleneck is on the Q side.

Cross-validation

  • Captum (gradient × activation): top-1 neuron agreement (n=2440)
  • TransformerLens logit lens: same routing layer identified
  • Activation patching: confirms causal involvement (-0.44 logit when neuron zeroed)

Verify

  • Zenodo — paper PDF + permanent DOI for citation.
  • HuggingFace dashboard — corrected GPT-2 L10 weights, QKV analysis scripts, interactive dashboard, all figures. Anyone can re-run the 8 capital-city prompts and check.

arXiv preprint: forthcoming.