Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

arXiv:2605.06076v1 Announce Type: new Abstract: The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the…

cs.CL updates on arXiv.org · May 8 · 1 min read · score 7.0

From the source