Back to feed
arXiv cs.AI·

Asking Back: Interaction-Layer Antidistillation Watermarks

Signal
78
Hype
15
In three linesNew watermarking approach against unauthorized LLM distillation: behavioral markers (follow-up questions, low-frequency variants, restatements) injected via system prompt. Tested on 63 LoRA-distilled models from Llama-3.3-70B, with transfer rates 88.9% (Gemma) to 45.2% (Qwen). Robustness validated against DIPPER paraphrasing and user study (N=20) confirming imperceptibility.
Read source
Your take?
AI safetyAlignmentLlamaBenchmarksPapers

Summary generated by Claude — human-verified