arXiv cs.CL·2 June 2026

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Signal

Hype

In three linesNew automated metric (Triangulated Preference Shift score) to measure lexical biases introduced during preference learning (RLHF) in LLMs without manual curation. Analysis across 6 model families reveals a shift toward a 'language of prestige' (overuse of 'delve', 'furthermore').

Read source

Your take?

Reinforcement learning Alignment Evals Papers

Summary generated by Claude — human-verified

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Other angles on this story