Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]
Signal
35
Hype
15
In three linesDiscussion on fine-tuning small LLMs with annotated conversational data including reasoning traces and tool-calling decisions. Author proposes structuring data as samples with full history and loss masking on non-assistant tokens. Asks whether SFT is sufficient or if RL (PPO, GRPO, DPO) is needed to optimize tool use.Read source
Your take?
Summary generated by Claude — human-verified