COOPO: Cyclic Offline-Online Policy Optimization Algorithm
Signal
72
Hype
28
In three linesCOOPO is a hybrid offline-online reinforcement learning algorithm that cycles between KL-regularized offline training and online fine-tuning. Periodic returns to offline training eliminate catastrophic forgetting and distribution drift. On D4RL benchmarks, COOPO reduces online interactions while improving final returns compared to state-of-the-art hybrids.Read source
Your take?
Summary generated by Claude — human-verified