Back to feed
arXiv cs.CL·

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

Signal
78
Hype
15
In three linesMu-GRPO improves GRPO efficiency by tolerating higher rollout staleness. The framework organizes training into four sequential generation-optimization stages, reducing system overhead by 2x while maintaining performance on math reasoning benchmarks.
Read source
Your take?
Reinforcement learningReasoningBenchmarks

Summary generated by Claude — human-verified