This paper proposes segment-aligned policy optimization, a reinforcement-learning formulation that matches training updates to reasoning steps rather than tokens or full responses. It could improve credit assignment and stability for multi-modal reasoning models.
arXiv:2605.01327v1 Announce Type: new Abstract: Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a…