arXiv cs.AI·1 June 2026

Distilling LLM Feedback for Lean Theorem Proving

Signal

Hype

In three linesNew post-training method for reasoning models: Feedback Distillation trains the model to match its own distribution conditioned on LLM-generated feedback at token level. Tested on Lean4 theorem-proving, it maintains greater trajectory diversity than GRPO, improves policy entropy and pass@k scaling. Combined with GRPO, it outperforms either method alone.

Read source

Your take?

Reasoning Reinforcement learning Fine-tuning Papers

Summary generated by Claude — human-verified

Distilling LLM Feedback for Lean Theorem Proving

Other angles on this story