Back to feed
Reddit r/LocalLLaMA·

Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster

Signal
72
Hype
25
In three linesGRPO fine-tuning study on tiny models (Qwen2.5-0.5B, LFM-2.5-350M) for Reddit post summarization constrained to exactly 64 tokens. Comparison of staged training (length first, then quality) vs joint training. Staged curriculum wins with G-Eval scores of 2.904 (LFM) and 2.817 (Qwen), vs 2.376/2.332 baseline zero-shot.
Read source
Your take?
QwenFine-tuningReinforcement learningEvalsOpen source

Summary generated by Claude — human-verified