CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO
Signal
72
Hype
18
In three linesCAST is an answer-free self-distillation method for GRPO (Group Relative Policy Optimization). It uses a stop-gradient self-teacher to shape token-level advantages by trajectory correctness, with bidirectional advantage sign reversal and bounded advantages for zero-variance groups. Improves mathematical reasoning performance.Read source
Your take?
Summary generated by Claude — human-verified