Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
Signal
72
Hype
18
In three linesResearchers identify preference instability in reward models under subtle input variations (paraphrasing, pattern injection, backdoors). They isolate unstable features using Sparse Autoencoders (SAEs) and propose two mitigation strategies: SAE Feature Steering and SAE Residual Correction, reducing incorrect preference assignments without retraining the model.Read source
Your take?
Summary generated by Claude — human-verified