Section

AI Safety

Alignment research, interpretability, red-teaming, and the empirical work behind safe deployment.

23 stories

Safety
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
arXiv:2605.05403v1 Announce Type: new Abstract: This position paper argues that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity. Existing work often operationalizes sycophancy…
cs.AI updates on arXiv.orgMay 9Score 7.0
May 9
Score 7.0
Safety
Understanding Annotator Safety Policy with Interpretability
arXiv:2605.05329v1 Announce Type: new Abstract: Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can…
cs.AI updates on arXiv.orgMay 9Score 7.0
May 9
Score 7.0
Safety
The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias
arXiv:2605.05427v1 Announce Type: new Abstract: As Large Language Models (LLMs) are integrated into global software systems, ensuring equitable safety guardrails is a critical requirement. Current fairness evaluations…
cs.AI updates on arXiv.orgMay 9Score 7.0
May 9
Score 7.0
Safety
COPYCOP: Ownership Verification for Graph Neural Networks
arXiv:2605.05360v1 Announce Type: new Abstract: Given two GNNs that output node embeddings, how can we determine if they were trained independently? An adversary could have trained one GNN specifically to mimic the…
cs.LG updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
Information Theoretic Adversarial Training of Large Language Models
arXiv:2605.05415v1 Announce Type: new Abstract: Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack…
cs.LG updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
Adversarial Graph Neural Network Benchmarks: Towards Practical and Fair Evaluation
arXiv:2605.05534v1 Announce Type: new Abstract: Adversarial learning and the robustness of Graph Neural Networks (GNNs) are topics of widespread interest in the machine learning community, as documented by the number of…
cs.LG updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
MidSteer: Optimal Affine Framework for Steering Generative Models
arXiv:2605.05220v1 Announce Type: new Abstract: Steering intermediate representations has emerged as a powerful strategy for controlling generative models, particularly in post-deployment alignment and safety settings.…
cs.LG updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series
arXiv:2605.05524v1 Announce Type: new Abstract: Causal representation learning (CRL) seeks to recover latent variables with identifiability guarantees, typically up to permutation and component-wise reparameterization…
cs.LG updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
Data-Driven Variational Basis Learning Beyond Neural Networks: A Non-Neural Framework for Adaptive Basis Discovery
arXiv:2605.05221v1 Announce Type: new Abstract: Classical representation systems such as Fourier series, wavelets, and fixed dictionaries provide analytically tractable basis expansions, but they are not intrinsically…
cs.LG updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
arXiv:2605.05653v1 Announce Type: new Abstract: Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic…
cs.CL updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
arXiv:2605.05950v1 Announce Type: new Abstract: The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from LLM-generated counterparts a…
cs.CL updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity
arXiv:2605.05662v1 Announce Type: new Abstract: Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a…
cs.CL updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
arXiv:2605.06076v1 Announce Type: new Abstract: The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic…
cs.CL updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
arXiv:2605.06327v1 Announce Type: new Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a…
cs.CL updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Safety
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
arXiv:2605.05630v1 Announce Type: new Abstract: Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single…
cs.CL updates on arXiv.orgMay 8Score 7.0
May 8
Score 7.0
Agentic
A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents
This paper proposes a low-latency fraud-detection layer for spotting adversarial interaction patterns in LLM agents. It matters because agent defenses need to operate in real time, not just at the prompt-filtering stage.
cs.AI updates on arXiv.orgMay 6Score 9.9
May 6
Score 9.9
Safety
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability
NEURON combines SNOMED CT ontology grounding with machine learning to make clinical predictions more explainable. It is relevant for builders working on trustworthy medical AI, though the contribution appears narrower…
cs.AI updates on arXiv.orgMay 6Score 9.2
May 6
Score 9.2
Safety
AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries
This paper reframes AI safety around irreversibility, arguing that low-friction deployment changes the control problem more than raw capability does. It should interest safety researchers looking for a systems-level…
cs.AI updates on arXiv.orgMay 6Score 8.9
May 6
Score 8.9
Safety
MILD: Mediator Agent System with Bidirectional Perception and Multi-Layered Alignment for Human-Vehicle Collaboration
This paper proposes a mediator-agent framework for human-vehicle collaboration that models both driver state and vehicle intent. It is relevant to safety work because it targets coordination failures caused by poor…
cs.AI updates on arXiv.orgMay 6Score 8.1
May 6
Score 8.1
Agentic
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
This paper studies how a jailbreak can propagate across multi-agent systems and proposes a foresight-guided defense to stop the spread early. It matters for builders shipping agent swarms, where one compromised agent…
cs.AI updates on arXiv.orgMay 6Score 10.0
May 6
Score 10.0
Agentic
Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment
This position paper argues that multi-agent safety depends more on interaction topology than on the alignment or scale of the underlying models. For builders of agentic systems, it reframes safety as a systems-design…
cs.AI updates on arXiv.orgMay 6Score 9.3
May 6
Score 9.3
GenAI
Understanding Emergent Misalignment via Feature Superposition Geometry
This paper offers a geometric explanation for emergent misalignment in fine-tuned LLMs, framing it as a feature-superposition problem rather than a mysterious safety failure. It should be useful for researchers studying…
cs.AI updates on arXiv.orgMay 6Score 10.0
May 6
Score 10.0
Safety
This startup's new mechanistic interpretability tool lets you debug LLMs
A startup is pitching a mechanistic-interpretability tool for inspecting and steering LLM internals during training. If the claims hold up, it could give researchers a more direct way to debug model behavior and shape…
Artificial intelligence – MIT Technology ReviewApr 30Score 7.0
Apr 30
Score 7.0