Back to feed
arXiv cs.AI·

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

Signal
78
Hype
15
In three linesPROF, a data curation method, combines Process Reward Models (PRM) and outcome rewards (ORM) to improve reinforcement learning on reasoning tasks. It filters training samples by keeping correct responses with strong process support and incorrect responses with weak process support, avoiding instability from direct PRM optimization.
Read source
Your take?
Reinforcement learningReasoningEvals

Summary generated by Claude — human-verified