arXiv cs.AI·20 May 2026

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Signal

Hype

In three linesDMPO (Distribution-Matching Policy Optimization) solves mode collapse in on-policy RL methods like GRPO by approximating forward KL instead of reverse KL. On text and vision NP-Bench, DMPO achieves 43.9% and 43.1% Quality Ratio (vs 40.1% and 38.4% for GRPO), with +2.0% gains on mathematical reasoning.

Read source

Your take?

Reinforcement learning Reasoning Benchmarks

Summary generated by Claude — human-verified

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Other angles on this story