Back to feed
arXiv cs.AI·

Distilling LLM Feedback for Lean Theorem Proving

Signal
75
Hype
15
In three linesNew post-training method for reasoning models: Feedback Distillation trains the model to match its own distribution conditioned on LLM-generated feedback at token level. Tested on Lean4 theorem-proving, it maintains greater trajectory diversity than GRPO, improves policy entropy and pass@k scaling. Combined with GRPO, it outperforms either method alone.
Read source
Your take?
ReasoningReinforcement learningFine-tuningPapers

Summary generated by Claude — human-verified