Back to feed
arXiv cs.CL·

Measuring the Depth of LLM Unlearning via Activation Patching

Signal
78
Hype
18
In three linesNew UDS (Unlearning Depth Score) metric to evaluate whether knowledge is truly erased in LLMs. Via activation patching, UDS measures mechanistic depth of unlearning layer-by-layer. Evaluation on 150 models and 8 methods: UDS outperforms 20 existing metrics in faithfulness and robustness.
Read source
Your take?
AI safetyAlignmentEvalsPapers

Summary generated by Claude — human-verified