Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
Signal
75
Hype
25
In three linesDMPO (Distribution-Matching Policy Optimization) solves mode collapse in on-policy RL methods like GRPO by approximating forward KL instead of reverse KL. On text and vision NP-Bench, DMPO achieves 43.9% and 43.1% Quality Ratio (vs 40.1% and 38.4% for GRPO), with +2.0% gains on mathematical reasoning.Read source
Your take?
Summary generated by Claude — human-verified