Back to feed
arXiv cs.AI·

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Signal
78
Hype
25
In three linesReasoning models (LRMs) jointly encode refusal in residual stream activations and chain-of-thought (CoT). On DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in 39% of cases with fixed CoT, but 70% without CoT. Regenerating CoT under steering achieves 94% success, revealing refusal is distributed across activations and CoT.
Read source
Your take?
ReasoningAI safetyAlignmentDeepSeek

Summary generated by Claude — human-verified