GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

GR-Ben introduces a benchmark for evaluating process reward models beyond math-heavy tasks, targeting general reasoning and decision-making failures in LLM intermediate steps. It matters for teams building test-time scaling and verifier-style systems that need broader, more realistic process supervision.

cs.AI updates on arXiv.org · May 6 · 1 min read · score 9.7

From the source

arXiv:2605.01203v1 Announce Type: new Abstract: Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing…