Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning
Signal
72
Hype
18
In three linesNew automated metric (Triangulated Preference Shift score) to measure lexical biases introduced during preference learning (RLHF) in LLMs without manual curation. Analysis across 6 model families reveals a shift toward a 'language of prestige' (overuse of 'delve', 'furthermore').Read source
Your take?
Summary generated by Claude — human-verified