arXiv cs.CL·26 May 2026

Measuring the Depth of LLM Unlearning via Activation Patching

Signal

Hype

In three linesNew UDS (Unlearning Depth Score) metric to evaluate whether knowledge is truly erased in LLMs. Via activation patching, UDS measures mechanistic depth of unlearning layer-by-layer. Evaluation on 150 models and 8 methods: UDS outperforms 20 existing metrics in faithfulness and robustness.

Read source

Your take?

AI safety Alignment Evals Papers

Summary generated by Claude — human-verified

Measuring the Depth of LLM Unlearning via Activation Patching

Other angles on this story