Back to feed
arXiv cs.LG·

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Signal
72
Hype
18
In three linesFiRe-OPD introduces fine-grained on-policy distillation combining trajectory filtering and soft token reweighting. Validated on AIME 2024 (+6.25 strong-to-weak) and Miner (+18.81 multi-teacher), the method outperforms recent token-level OPD approaches in stability and performance.
Read source
Your take?
Reinforcement learningFine-tuningPapers

Summary generated by Claude — human-verified