paper · 2026
QKV Decomposition for Transformer XAI
Diagnose transformer prediction failures from weights alone, then correct them by retraining one layer. GPT-2 capital-city accuracy 2/8 → 8/8 with zero side effects, achievable through any of attention, FFN, or V-only (590K params).
What this paper does
We present a method for diagnosing transformer prediction errors and surgically correcting them through Q/K/V weight analysis. By decomposing attention head weights into query functions, key responses, and value channels, failure causes can be identified without running any input through the model.
Applied to GPT-2: factual knowledge (e.g., France→Paris) emerges at layer 10 head 8 (+25.8 logit contribution) and is subsequently reversed at layer 12 head 0 (+149.9 for “the”). Targeted retraining of only the diagnosed layer recovers knowledge accuracy from 2/8 to 8/8 capitals with zero side effects (general capability 11/15 maintained, PPL 42.7 → 42.6).
The model already possesses this knowledge internally; the failure is one of routing, not absence.
Why it matters
Routing correction can be achieved through attention V, FFN, or even V-only (Wv slice, 590K parameters — 0.5% of GPT-2). Knowledge routing is not confined to FFN layers, contrary to the common interpretation behind ROME and similar methods.
This opens correction pathways beyond MLP-only model editing.
Static diagnostics (input-free)
The method also classifies all 144 attention heads in GPT-2 small directly from weights — no input pass required:
Cross-validation
- Captum (gradient × activation): top-1 neuron agreement (n=2440)
- TransformerLens logit lens: same routing layer identified
- Activation patching: confirms causal involvement (-0.44 logit when neuron zeroed)
Verify
- Zenodo — paper PDF + permanent DOI for citation.
- HuggingFace dashboard — corrected GPT-2 L10 weights, QKV analysis scripts, interactive dashboard, all figures. Anyone can re-run the 8 capital-city prompts and check.
arXiv preprint: forthcoming.