arXiv cs.AI·27 May 2026

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Signal

Hype

In three linesReasoning models (LRMs) jointly encode refusal in residual stream activations and chain-of-thought (CoT). On DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in 39% of cases with fixed CoT, but 70% without CoT. Regenerating CoT under steering achieves 94% success, revealing refusal is distributed across activations and CoT.

Read source

Your take?

Reasoning AI safety Alignment DeepSeek

Summary generated by Claude — human-verified

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Other angles on this story