Back to feed
arXiv cs.LG·

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

Signal
72
Hype
18
In three linesReMax formalizes exploration in RL through retries: a policy is evaluated by expected maximum return over M samples. Exploration emerges naturally without explicit bonuses. RePPO, a PPO variant optimizing ReMax, generalizes discrete M to continuous parameter m for fine-grained exploration control. Results on MinAtar and Craftax benchmarks.
Read source
Your take?
Reinforcement learningBenchmarks

Summary generated by Claude — human-verified