Label-Free Reinforcement Learning via Cross-Model Entropy
Signal
78
Hype
25
In three linesCross-Model Entropy (CME) proposes a label-free reward signal for LLM post-training RL. CME uses mean log-likelihood of responses under an independent verifier model, avoiding self-consistency and reward hacking. Integrated into GRPO, CME achieves 52.5–71.4% tie-adjusted win rates on UltraFeedback/AlpacaEval 2.0 across Qwen, Llama, Gemma, OLMo.Read source
Your take?
Summary generated by Claude — human-verified