Attention Is Where You Attack

arXiv:2605.00236v1 Announce Type: cross Abstract: Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the Attention Redistribution Attack (ARA), a white-box adversarial attack that identifies safety-critical attention heads and crafts nonsemantic adversarial tokens that redirect attention away from safety-relevant positions. Unlike prior jailbreak…

cs.AI updates on arXiv.org · May 5 · 1 min read

From the source

Originally published at cs.AI updates on arXiv.orgRead at source →