arXiv cs.LG·1 June 2026

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Signal

Hype

In three linesMulti-model study (Pythia-1.4B, Gemma-2, Qwen2.5-7B, Llama-3.1-8B) on linear representations of synthetic dishonesty. Linear probes detect deception with AUC ≥0.99 as early as layers 1-3. Dishonesty representations consolidate progressively in deeper layers, with implications for activation-based monitoring.

Read source

Your take?

Papers AI safety Alignment Evals Reasoning

Summary generated by Claude — human-verified

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Other angles on this story