The Unlearnability Phenomenon in RLVR for Language Models
Signal
75
Hype
15
In three linesStudy reveals an 'unlearnability' phenomenon in Reinforcement Learning with Verifiable Reward (RLVR) for LLMs. Some hard examples remain unlearnable even with correct rollouts. Cross-example gradient analysis shows fundamental representation flaws: low gradient similarity and ungeneralizable reasoning patterns. Data augmentation fails to improve gradient similarity.Read source
Your take?
Summary generated by Claude — human-verified