Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
Signal
78
Hype
25
In three linesReasoning models (LRMs) jointly encode refusal in residual stream activations and chain-of-thought (CoT). On DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in 39% of cases with fixed CoT, but 70% without CoT. Regenerating CoT under steering achieves 94% success, revealing refusal is distributed across activations and CoT.Read source
Your take?
Summary generated by Claude — human-verified