Measuring the Depth of LLM Unlearning via Activation Patching
Signal
78
Hype
18
In three linesNew UDS (Unlearning Depth Score) metric to evaluate whether knowledge is truly erased in LLMs. Via activation patching, UDS measures mechanistic depth of unlearning layer-by-layer. Evaluation on 150 models and 8 methods: UDS outperforms 20 existing metrics in faithfulness and robustness.Read source
Your take?
Summary generated by Claude — human-verified