Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
Signal
72
Hype
15
In three linesStudy on controlled removal of safety alignment in language models to evaluate cybersecurity capabilities. Compares authorized-context prompting, refusal-direction projection, and LoRA-based de-alignment. On 60 tasks (Security-AR), task-only LoRA reaches 0.87 security score with 0.83 general capability, but increases out-of-scope unsafe compliance.Read source
Your take?
Summary generated by Claude — human-verified