Back to feed
arXiv cs.CL·

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Signal
72
Hype
18
In three linesNew automated metric (Triangulated Preference Shift score) to measure lexical biases introduced during preference learning (RLHF) in LLMs without manual curation. Analysis across 6 model families reveals a shift toward a 'language of prestige' (overuse of 'delve', 'furthermore').
Read source
Your take?
Reinforcement learningAlignmentEvalsPapers

Summary generated by Claude — human-verified