arXiv cs.LG·17 June 2026

Rift: A Conflict Signature for Deception in Language Models

Signal

Hype

In three linesResearchers identify an internal signature of deception in language models: deceptive responses show 2.1-2.3x higher residual rank than naively false answers. This signature detects deception with 100% accuracy on GPT-2, Qwen2.5, and Phi-3, and transfers zero-shot across model families and languages (AUC 0.933-1.0).

Read source

Your take?

AI safety Alignment Evals Papers

Summary generated by Claude — human-verified

Rift: A Conflict Signature for Deception in Language Models

Other angles on this story