Back to feed
arXiv cs.CL·

General Preference Reinforcement Learning

Signal
75
Hype
25
In three linesNew GPRL (General Preference Reinforcement Learning) method replaces scalar reward models with General Preference Model (GPM) using k skew-symmetric subspaces. Tested on Llama-3-8B-Instruct: 56.51% win rate AlpacaEval 2.0, outperforms SimPO and SPPO on Arena-Hard, MT-Bench, WildBench by preventing single-axis reward hacking.
Read source
Your take?
Reinforcement learningLlamaAlignmentBenchmarks

Summary generated by Claude — human-verified