arXiv cs.LG·3 June 2026

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Signal

Hype

In three linesFiRe-OPD introduces fine-grained on-policy distillation combining trajectory filtering and soft token reweighting. Validated on AIME 2024 (+6.25 strong-to-weak) and Miner (+18.81 multi-teacher), the method outperforms recent token-level OPD approaches in stability and performance.

Read source

Your take?

Reinforcement learning Fine-tuning Papers

Summary generated by Claude — human-verified

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Other angles on this story