Back to feed
arXiv cs.AI·

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

Signal
72
Hype
18
In three linesCAST is an answer-free self-distillation method for GRPO (Group Relative Policy Optimization). It uses a stop-gradient self-teacher to shape token-level advantages by trajectory correctness, with bidirectional advantage sign reversal and bounded advantages for zero-variance groups. Improves mathematical reasoning performance.
Read source
Your take?
Reinforcement learningReasoningCode generationPapers

Summary generated by Claude — human-verified