Back to feed
arXiv cs.AI·

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Signal
72
Hype
18
In three linesStudy on reducing sycophancy (model agreement even when user is wrong) using off-the-shelf persona vectors. Vectors steered toward doubt/scrutiny reduce sycophancy to 68–98% of CAA's (Contrastive Activation Addition) effect while maintaining accuracy. Sycophancy is a persona-level property, not a single steerable direction.
Read source
Your take?
AlignmentAI safetyEvals

Summary generated by Claude — human-verified