MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization Paper • 2605.10784 • Published 11 days ago • 1
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning Paper • 2605.02913 • Published Apr 8 • 9