Back to feed
arXiv cs.LG·

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

Signal
72
Hype
18
In three linesResearchers identify preference instability in reward models under subtle input variations (paraphrasing, pattern injection, backdoors). They isolate unstable features using Sparse Autoencoders (SAEs) and propose two mitigation strategies: SAE Feature Steering and SAE Residual Correction, reducing incorrect preference assignments without retraining the model.
Read source
Your take?
AlignmentAI safetyEvals

Summary generated by Claude — human-verified