Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
Signal
78
Hype
15
In three linesPROF, a data curation method, combines Process Reward Models (PRM) and outcome rewards (ORM) to improve reinforcement learning on reasoning tasks. It filters training samples by keeping correct responses with strong process support and incorrect responses with weak process support, avoiding instability from direct PRM optimization.Read source
Your take?
Summary generated by Claude — human-verified