Back to feed
arXiv cs.AI·

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Signal
72
Hype
25
In three linesDistinguishable Deletion (D²) unifies knowledge deletion and refusal for LLM unlearning. The method uses an energy index to erase undesirable knowledge in latent representations rather than specific tokens, avoiding biased deletion and re-emergence of harmful content. Energy-based Unlearning Alignment (EUA) applies this mechanism at training and inference.
Read source
Your take?
AI safetyAlignmentPapersReinforcement learning

Summary generated by Claude — human-verified