Section

AI Safety Research

Empirical safety research: alignment faking, scheming, sabotage evals, AI control protocols. Focus on primary research, not policy op-eds.

6 stories

GenAI
Understanding Emergent Misalignment via Feature Superposition Geometry
This paper offers a geometric explanation for emergent misalignment in fine-tuned LLMs, framing it as a feature-superposition problem rather than a mysterious safety failure. It should be useful for researchers studying…
cs.AI updates on arXiv.orgMay 6Score 10.0
May 6
Score 10.0
Agentic
A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents
This paper proposes a low-latency fraud-detection layer for spotting adversarial interaction patterns in LLM agents. It matters because agent defenses need to operate in real time, not just at the prompt-filtering stage.
cs.AI updates on arXiv.orgMay 6Score 9.9
May 6
Score 9.9
Agentic
Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment
This position paper argues that multi-agent safety depends more on interaction topology than on the alignment or scale of the underlying models. For builders of agentic systems, it reframes safety as a systems-design…
cs.AI updates on arXiv.orgMay 6Score 9.3
May 6
Score 9.3
Safety
AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries
This paper reframes AI safety around irreversibility, arguing that low-friction deployment changes the control problem more than raw capability does. It should interest safety researchers looking for a systems-level…
cs.AI updates on arXiv.orgMay 6Score 8.9
May 6
Score 8.9
Agentic
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
This paper studies how a jailbreak can propagate across multi-agent systems and proposes a foresight-guided defense to stop the spread early. It matters for builders shipping agent swarms, where one compromised agent…
cs.AI updates on arXiv.orgMay 6Score 10.0
May 6
Score 10.0
physical ai
Artificial intelligence | MIT News | Massachusetts Institute of Technology
MIT highlights a training method that makes reasoning models better at expressing uncertainty without losing accuracy. For builders, that matters because calibrated confidence is a practical lever for reducing…
Tavily · Physical AiScore 9.7
Score 9.7

Understanding Emergent Misalignment via Feature Superposition Geometry

A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents

Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment

AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Artificial intelligence | MIT News | Massachusetts Institute of Technology