OpenAI introduced GPT-5.6 Preview, a family of models named Sol, Terra, and Luna, with Sol positioned as the flagship model. The system card describes stronger cyber and bio safety testing, new safeguards, and a limited…
OpenAI introduced LifeSciBench, an expert-judged benchmark that evaluates AI systems on end-to-end life sciences workflows such as evidence analysis, experimental design, scientific reasoning, and research communication…
This was my last week at the Allen Institute for AI (Ai2), where I got the great privilege to work on the Olmo models, to grow, to learn, and to have broad lasting impacts.
Anthropic outlines two possible 2028 global AI leadership scenarios: one where the US retains its compute advantage and shapes AI norms, and another where China competes closely due to policy inaction. The US currently…
This paper proposes Variational Linear Attention, an online least-squares formulation that stabilizes linear attention memory with an adaptive penalty matrix. It targets a core bottleneck in long-context transformers:…
This paper reframes oversmoothing in neural sheaf diffusion as a representation-degeneracy problem and brings quiver/sheaf theory to bear on the dynamics. It is mathematically rich, but the practical payoff for GenAI…
ASD-Bench introduces a four-axis benchmark for autism spectrum disorder screening across children, adolescents, and adults. It gives researchers a more structured way to compare classical ML, deep learning, and…
This paper tightens the evaluation of diffusion-based OOD detectors by controlling for backbone choice and test-time budget, then proposes sparse internal feature snapshots as a fairer detector family. It matters most…
A reflective analysis of how open model ecosystems can reinforce themselves through participation, iteration, and distribution. The piece is most useful as a strategic read on why open-first AI communities can compound…
State space models are moving from a niche alternative to a credible transformer competitor, with tradeoffs that matter for long-context efficiency and scaling. The piece is a useful snapshot of where SSMs fit, and…
arXiv:2605.08111v1 Announce Type: new Abstract: The widespread availability of complex time series data in various domains such as environmental science, epidemiology, and economics demands robust causal discovery…
arXiv:2605.08200v1 Announce Type: new Abstract: A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region…
arXiv:2605.08545v1 Announce Type: new Abstract: Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by…
arXiv:2605.08816v1 Announce Type: new Abstract: In the animal kingdom, mirror self-recognition is a canonical probe of higher-order cognition, emerging only in some species. We ask whether an analogous functional…
arXiv:2605.08220v1 Announce Type: new Abstract: The automated extraction of data from scientific charts is a critical task for large-scale literature analysis. While multimodal Large Language Models (LLMs) show promise,…
arXiv:2605.08202v1 Announce Type: new Abstract: Offline reinforcement learning (RL) faces a critical challenge of overestimating the value of out-of-distribution (OOD) actions. Existing methods mitigate this issue by…
arXiv:2605.08445v1 Announce Type: new Abstract: AI models are increasingly deployed in live clinical environments where they must perform reliably across complex, high-stakes workflows that standard training and…
arXiv:2605.08448v1 Announce Type: new Abstract: Semi-supervised learning approaches have been investigated as a means to enhance the analysis of social media data in disaster management contexts. In this work, we…
arXiv:2605.08144v1 Announce Type: new Abstract: Diffusion models have achieved remarkable success across a wide range of generative tasks, yet their training paradigm largely treats injected noise as uniformly…
arXiv:2605.08368v1 Announce Type: new Abstract: Debates about large language model post-training often treat supervised fine-tuning (SFT) as imitation and reinforcement learning (RL) as discovery. But this distinction…
arXiv:2605.08138v1 Announce Type: new Abstract: Synthetic data has emerged as a crucial solution to the data scarcity bottleneck in large language models (LLMs), particularly for specialized domains and low-resource…
arXiv:2605.08174v1 Announce Type: new Abstract: To mitigate the memory constraints associated with fine-tuning large pre-trained models, existing parameter-efficient fine-tuning (PEFT) methods, such as LoRA, rely on…
arXiv:2605.08197v1 Announce Type: new Abstract: Most causal benchmarks for language models score local answers or graph structure. We introduce ReplaySCM, a 1,300 item benchmark for executable causal mechanism induction…
arXiv:2605.08354v1 Announce Type: new Abstract: Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment.…
arXiv:2605.08388v1 Announce Type: new Abstract: Human-AI teams play a pivotal role in improving overall system performance when neither the human nor the model can achieve such performance on their own. With the advent…
arXiv:2605.08703v1 Announce Type: new Abstract: Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference…
arXiv:2605.08538v1 Announce Type: new Abstract: Current LLM agents lack principled mechanisms for managing persistent memory across long interaction horizons. We present a biologically-grounded memory architecture…
arXiv:2605.08614v1 Announce Type: new Abstract: Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective…
arXiv:2605.08113v1 Announce Type: new Abstract: Accurate predictions of smallholder maize yields across national boundaries are critical for food security planning in sub-Saharan Africa, yet most published benchmarks…
arXiv:2605.08533v1 Announce Type: new Abstract: Clinical decision-making in emergency medicine demands rapid, accurate diagnoses under uncertainty. Despite benchmark progress, evidence for LLMs as interactive aids in…
arXiv:2605.08177v1 Announce Type: new Abstract: Parameter-efficient fine-tuning (PEFT) has become a practical route for adapting large language models to downstream tasks, with LoRA-style methods being particularly…
arXiv:2605.08776v1 Announce Type: new Abstract: Reasoning-centric large language models (LLMs) achieve strong performance by generating intermediate reasoning trajectories, but often incur excessive token usage and high…