Probing Persona-Dependent Preferences in Language Models
Signal
78
Hype
25
In three linesResearchers identify a shared preference vector in Gemma-3-27B and Qwen-3.5-122B by training linear probes on residual-stream activations. This vector predicts and causally controls the model's task choices across different personas, including an evil persona, revealing a largely shared preference representation underlying different behavioral modes.Read source
Your take?
Summary generated by Claude — human-verified