Mitigating Cognitive Bias in RLHF by Altering Rationality

arXiv:2605.06895v1 Announce Type: new Abstract: How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards are inferred from pairwise comparisons, this learning depends on an assumed relationship between latent reward differences and observed preferences, typically modeled using a Boltzmann formulation in…

cs.AI updates on arXiv.org · May 11 · 1 min read · score 7.0

From the source