arXiv cs.LG·19 May 2026

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

Signal

Hype

In three linesResearchers identify preference instability in reward models under subtle input variations (paraphrasing, pattern injection, backdoors). They isolate unstable features using Sparse Autoencoders (SAEs) and propose two mitigation strategies: SAE Feature Steering and SAE Residual Correction, reducing incorrect preference assignments without retraining the model.

Read source

Your take?

Alignment AI safety Evals

Summary generated by Claude — human-verified

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

Other angles on this story