Distilling LLM Feedback for Lean Theorem Proving
Signal
75
Hype
15
In three linesNew post-training method for reasoning models: Feedback Distillation trains the model to match its own distribution conditioned on LLM-generated feedback at token level. Tested on Lean4 theorem-proving, it maintains greater trajectory diversity than GRPO, improves policy entropy and pass@k scaling. Combined with GRPO, it outperforms either method alone.Read source
Your take?
Summary generated by Claude — human-verified