GR-Ben introduces a benchmark for evaluating process reward models beyond math-heavy tasks, targeting general reasoning and decision-making failures in LLM intermediate steps. It matters for teams building test-time scaling and verifier-style systems that need broader, more realistic process supervision.
arXiv:2605.01203v1 Announce Type: new Abstract: Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing…