Reddit r/MachineLearning·1 June 2026

Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]

Signal

Hype

In three linesDiscussion on fine-tuning small LLMs with annotated conversational data including reasoning traces and tool-calling decisions. Author proposes structuring data as samples with full history and loss masking on non-assistant tokens. Asks whether SFT is sufficient or if RL (PPO, GRPO, DPO) is needed to optimize tool use.

Read source

Your take?

Fine-tuning Reasoning Reinforcement learning AI Agents

Summary generated by Claude — human-verified

Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]

Other angles on this story