GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper • 2507.01006 • Published Jul 1, 2025 • 248
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Paper • 2507.01001 • Published Jul 1, 2025 • 46
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published Jun 13, 2025 • 24
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner Paper • 2506.09003 • Published Jun 10, 2025 • 18
AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy Paper • 2506.13284 • Published Jun 16, 2025 • 26
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Paper • 2506.01939 • Published Jun 2, 2025 • 187
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Paper • 2506.13585 • Published Jun 16, 2025 • 273
Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure Paper • 2506.12278 • Published Jun 13, 2025 • 16
Table-R1: Inference-Time Scaling for Table Reasoning Paper • 2505.23621 • Published May 29, 2025 • 93
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21, 2025 • 84