arXiv cs.AI·19 May 2026

Asking Back: Interaction-Layer Antidistillation Watermarks

Signal

Hype

In three linesNew watermarking approach against unauthorized LLM distillation: behavioral markers (follow-up questions, low-frequency variants, restatements) injected via system prompt. Tested on 63 LoRA-distilled models from Llama-3.3-70B, with transfer rates 88.9% (Gemma) to 45.2% (Qwen). Robustness validated against DIPPER paraphrasing and user study (N=20) confirming imperceptibility.

Read source

Your take?

AI safety Alignment Llama Benchmarks Papers

Summary generated by Claude — human-verified

Asking Back: Interaction-Layer Antidistillation Watermarks

Other angles on this story