Back to feed
arXiv cs.AI·

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Signal
75
Hype
25
In three linesDMPO (Distribution-Matching Policy Optimization) solves mode collapse in on-policy RL methods like GRPO by approximating forward KL instead of reverse KL. On text and vision NP-Bench, DMPO achieves 43.9% and 43.1% Quality Ratio (vs 40.1% and 38.4% for GRPO), with +2.0% gains on mathematical reasoning.
Read source
Your take?
Reinforcement learningReasoningBenchmarks

Summary generated by Claude — human-verified