Rift: A Conflict Signature for Deception in Language Models
Signal
82
Hype
25
In three linesResearchers identify an internal signature of deception in language models: deceptive responses show 2.1-2.3x higher residual rank than naively false answers. This signature detects deception with 100% accuracy on GPT-2, Qwen2.5, and Phi-3, and transfers zero-shot across model families and languages (AUC 0.933-1.0).Read source
Your take?
Summary generated by Claude — human-verified