Back to feed
Reddit r/MachineLearning·

Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]

Signal
35
Hype
15
In three linesDiscussion on fine-tuning small LLMs with annotated conversational data including reasoning traces and tool-calling decisions. Author proposes structuring data as samples with full history and loss masking on non-assistant tokens. Asks whether SFT is sufficient or if RL (PPO, GRPO, DPO) is needed to optimize tool use.
Read source
Your take?
Fine-tuningReasoningReinforcement learningAI Agents

Summary generated by Claude — human-verified