Back to feed
arXiv cs.LG·

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Signal
82
Hype
15
In three linesMulti-model study (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) on linear representations of synthetic dishonesty. Linear probes detect deception with AUC ≥0.99 as early as layers 1-3. Dishonesty representations consolidate progressively in deeper layers, with implications for activation-based monitoring.
Read source
Your take?
PapersAI safetyAlignmentEvalsReasoning

Summary generated by Claude — human-verified