Search

Every story we've curated, in one place. Type a phrase, a tool name, or a researcher. Quotes match phrases; a leading - excludes.

22 matches for “interpretability”

Safety
This startup's new mechanistic interpretability tool lets you debug LLMs
A startup is pitching a mechanistic-interpretability tool for inspecting and steering LLM internals during training. If the claims hold up, it could give researchers a more direct way to debug model behavior and shape…
Artificial intelligence – MIT Technology ReviewApr 30
Apr 30
Score 7.0
Safety
Understanding Annotator Safety Policy with Interpretability
arXiv:2605.05329v1 Announce Type: new Abstract: Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can…
cs.AI updates on arXiv.orgMay 9
May 9
Score 7.0
Agentic
Beyond the Black Box: Interpretability of Agentic AI Tool Use
arXiv:2605.06890v1 Announce Type: new Abstract: AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control.…
cs.AI updates on arXiv.orgMay 11
May 11
Score 7.0
Safety
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability
NEURON combines SNOMED CT ontology grounding with machine learning to make clinical predictions more explainable. It is relevant for builders working on trustworthy medical AI, though the contribution appears narrower…
cs.AI updates on arXiv.orgMay 6
May 6
Score 9.2
GenAI
Qwen-Scope: Decoding Intelligence, Unleashing Potential (9 minute read)
Qwen-Scope is an interpretability toolkit for Qwen3 and Qwen3.5 that exposes internal model behavior for analysis and control. It may be useful for debugging, controllable inference, and dataset inspection, though the…
TLDR AI FeedMay 1
May 1
Score 6.2
Safety
The Sequence AI of the Week #859: Reading Claude's Mind in English: A Note on Natural Language Autoencoders
Anthropic's fascinating new papers for the future of AI interpretability.
TheSequenceMay 13
May 13
Score 7.0
GenAI
Oversmoothing as Representation Degeneracy in Neural Sheaf Diffusion
This paper reframes oversmoothing in neural sheaf diffusion as a representation-degeneracy problem and brings quiver/sheaf theory to bear on the dynamics. It is mathematically rich, but the practical payoff for GenAI…
cs.LG updates on arXiv.orgMay 13
May 13
Score 7.3
Agentic
Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers
arXiv:2605.05725v1 Announce Type: new Abstract: Recent studies have explored large language models for time-series anomaly detection, yet existing approaches often rely on a single general-purpose model to directly…
cs.AI updates on arXiv.orgMay 9
May 9
Score 7.0
GenAI
A Foundation Model for Zero-Shot Logical Rule Induction
arXiv:2605.04916v1 Announce Type: new Abstract: Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific…
cs.AI updates on arXiv.orgMay 8
May 8
Score 7.0
Safety
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
arXiv:2605.05653v1 Announce Type: new Abstract: Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic…
cs.CL updates on arXiv.orgMay 8
May 8
Score 7.0
Safety
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
arXiv:2605.05950v1 Announce Type: new Abstract: The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from LLM-generated counterparts a…
cs.CL updates on arXiv.orgMay 8
May 8
Score 7.0
Safety
MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series
arXiv:2605.05524v1 Announce Type: new Abstract: Causal representation learning (CRL) seeks to recover latent variables with identifiability guarantees, typically up to permutation and component-wise reparameterization…
cs.LG updates on arXiv.orgMay 8
May 8
Score 7.0
Safety
Data-Driven Variational Basis Learning Beyond Neural Networks: A Non-Neural Framework for Adaptive Basis Discovery
arXiv:2605.05221v1 Announce Type: new Abstract: Classical representation systems such as Fourier series, wavelets, and fixed dictionaries provide analytically tractable basis expansions, but they are not intrinsically…
cs.LG updates on arXiv.orgMay 8
May 8
Score 7.0
Industry
Expert Routing for Communication-Efficient MoE via Finite Expert Banks
arXiv:2605.05278v1 Announce Type: new Abstract: Resource-efficient machine learning increasingly uses sparse Mixture-of-Experts (MoE) architectures, where the gate acts as both a learning component and a routing…
cs.LG updates on arXiv.orgMay 8
May 8
Score 7.0
Safety
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
arXiv:2605.06076v1 Announce Type: new Abstract: The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic…
cs.CL updates on arXiv.orgMay 8
May 8
Score 7.0
GenAI
Gyan: An Explainable Neuro-Symbolic Language Model
arXiv:2605.04759v1 Announce Type: new Abstract: Transformer based pre-trained large language models have become ubiquitous. There is increasing evidence to suggest that even with large scale pre-training, these models…
cs.CL updates on arXiv.orgMay 7
May 7
Score 7.0
Industry
A Dirac-Frenkel-Onsager principle: Instantaneous residual minimization with gauge momentum for nonlinear parametrizations of PDE solutions
arXiv:2605.00284v1 Announce Type: new Abstract: Dirac-Frenkel instantaneous residual minimization evolves nonlinear parametrizations of PDE solutions in time, but ill-conditioning can render the parameter dynamics…
cs.LG updates on arXiv.orgMay 4
May 4
Industry
OTSS: Output-Targeted Soft Segmentation for Contextual Decision-Weight Learning
arXiv:2605.00193v1 Announce Type: new Abstract: Many machine learning systems make constrained decisions by optimizing factorized objectives, but the context-specific objective is often treated as fixed. We study…
cs.LG updates on arXiv.orgMay 4
May 4
Industry
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization
arXiv:2605.00130v1 Announce Type: new Abstract: Learning meaningful representations from medical time series (MedTS) such as ECG or EEG signals is a critical challenge. These signals are often high-dimensional,…
cs.LG updates on arXiv.orgMay 4
May 4
Industry
Surprisal Minimisation over Goal-directed Alternatives Predicts Production Choice in Dialogue
arXiv:2605.00506v1 Announce Type: new Abstract: We model utterance production as probabilistic cost-sensitive choice over contextual alternatives, using information-theoretic notions of cost. We distinguish between…
cs.CL updates on arXiv.orgMay 4
May 4
Industry
Language-free Experience at Expo 2025 Osaka
arXiv:2605.00373v1 Announce Type: new Abstract: In line with the Global Communication Plan 2025, we have pursued the development of multilingual translation technologies to realize a language-barrier-free experience at…
cs.CL updates on arXiv.orgMay 4
May 4
Safety
What Physics do Data-Driven MoCap-to-Radar Models Learn?
arXiv:2605.00018v1 Announce Type: new Abstract: Data-driven MoCap-to-radar models generate plausible micro-Doppler spectrograms, but do they actually learn the underlying physics? We introduce a physics-based…
cs.LG updates on arXiv.orgMay 4
May 4