General Preference Reinforcement Learning
Signal
75
Hype
25
In three linesNew GPRL (General Preference Reinforcement Learning) method replaces scalar reward models with General Preference Model (GPM) using k skew-symmetric subspaces. Tested on Llama-3-8B-Instruct: 56.51% win rate AlpacaEval 2.0, outperforms SimPO and SPPO on Arena-Hard, MT-Bench, WildBench by preventing single-axis reward hacking.Read source
Your take?
Summary generated by Claude — human-verified