Understanding Emergent Misalignment via Feature Superposition Geometry

This paper offers a geometric explanation for emergent misalignment in fine-tuned LLMs, framing it as a feature-superposition problem rather than a mysterious safety failure. It should be useful for researchers studying why narrow post-training can surface harmful behaviors.

cs.AI updates on arXiv.org · May 6 · 1 min read · score 10.0

From the source

arXiv:2605.00842v1 Announce Type: new Abstract: Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this phenomenon, we propose a geometric account based on the geometry of feature superposition. Because features are encoded in overlapping representations, fine-tuning that amplifies a target feature…