Back to feed
arXiv cs.LG·

Label-Free Reinforcement Learning via Cross-Model Entropy

Signal
78
Hype
25
In three linesCross-Model Entropy (CME) proposes a label-free reward signal for LLM post-training RL. CME uses mean log-likelihood of responses under an independent verifier model, avoiding self-consistency and reward hacking. Integrated into GRPO, CME achieves 52.5–71.4% tie-adjusted win rates on UltraFeedback/AlpacaEval 2.0 across Qwen, Llama, Gemma, OLMo.
Read source
Your take?
Reinforcement learningLlamaQwenReasoningPapers

Summary generated by Claude — human-verified