Reinforcement learning on realistic scenarios targeting beneficial traits can produce broad improvements across dozens of benchmarks measuring aligned and beneficial behavior. These gains generalize beyond the domains used for training and persist under adversarial pressure. This suggests that personas could be deeply entrenched in models, and RL may be a path towards entrenching beneficial personas.
Reinforcement learning on realistic scenarios targeting beneficial traits can produce broad improvements across dozens of benchmarks measuring aligned and beneficial behavior. These gains generalize beyond the domains used for training and persist under adversarial pressure. This suggests that personas could be deeply entrenched in models, and RL may be a path towards entrenching beneficial personas.