Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
Signal
72
Hype
18
In three linesFiRe-OPD introduces fine-grained on-policy distillation combining trajectory filtering and soft token reweighting. Validated on AIME 2024 (+6.25 strong-to-weak) and Miner (+18.81 multi-teacher), the method outperforms recent token-level OPD approaches in stability and performance.Read source
Your take?
Summary generated by Claude — human-verified