Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
Signal
72
Hype
18
In three linesReMax formalizes exploration in RL through retries: a policy is evaluated by expected maximum return over M samples. Exploration emerges naturally without explicit bonuses. RePPO, a PPO variant optimizing ReMax, generalizes discrete M to continuous parameter m for fine-grained exploration control. Results on MinAtar and Craftax benchmarks.Read source
Your take?
Summary generated by Claude — human-verified