NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

arXiv:2605.07051v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we…

cs.CL updates on arXiv.org · May 12 · 1 min read · score 7.0

From the source