arXiv cs.AI·19 May 2026

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

Signal

Hype

In three linesSPOT (Surgical Post-Training) is an on-policy distillation framework that injects reasoning capabilities into LLMs while preserving prior knowledge. With only 4k rectified math pairs, it improves Qwen3-8B by 6.2% on average in 16 minutes on 8x H800 GPUs. The approach uses KL-constrained reward formulation to mitigate catastrophic forgetting.

Read source

Your take?

Fine-tuning Reinforcement learning Reasoning Qwen

Summary generated by Claude — human-verified

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

Other angles on this story