Evaluation Awareness in Language Models Has Limited Effect on Behaviour

arXiv:2605.05835v1 Announce Type: new Abstract: Large reasoning models (LRMs) sometimes note in their chain of thought (CoT) that they may be under evaluation. Researchers worry that this verbalised evaluation awareness (VEA) causes models to adapt their outputs strategically, optimising for perceived evaluation criteria, which, for instance, can make models appear safer than they actually are. However, whether VEA actually has this effect is largely unknown. We tested this across open-weight…

cs.CL updates on arXiv.org · May 8 · 1 min read · score 7.0

From the source