GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

arXiv:2605.07053v1 Announce Type: new Abstract: Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework…

cs.CL updates on arXiv.org · May 12 · 1 min read · score 7.0

From the source