Back to feed
arXiv cs.CL·

Reinforcement Learning for LLM Post-Training: A Survey

Signal
82
Hype
15
In three linesComprehensive survey of reinforcement learning post-training methods for LLMs. Unifies RLHF (DPO), RLVR (PPO, GRPO) and SFT within a single policy gradient framework. Detailed technical analysis of offline and iterative approaches with standardized notation for direct comparison.
Read source
Your take?
Reinforcement learningAlignmentPapers

Summary generated by Claude — human-verified