MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models
In three linesMechELK is a mechanistic interpretability framework for extracting latent knowledge from LLMs. Through three stages (localization via SAE, verification by causal probing, elicitation via representation engineering), it achieves 84.7% accuracy on TruthfulQA, outperforming CCS by 6.2% and identifies 78.3% of hidden knowledge when model output is incorrect.
## MechELK: extracting what the model knows but won't say
### 1. The concrete problem
LLMs routinely produce incorrect or evasive outputs while encoding the correct answer in their internal representations. This gap between internal knowledge and surface output — "latent knowledge" — has been documented since Burns et al. (2022) introduced Contrastive Consistency Search (CCS). CCS's weakness: it relies on contrastive activation patterns that hold for simple factual questions but degrade on multi-step reasoning. MechELK directly targets this failure mode.
### 2. The three-stage architecture
**Locate**: MechELK starts with Sparse Autoencoder (SAE) feature analysis. SAEs, popularized by Anthropic for concept mapping in neural networks, decompose activations into interpretable features. Combined with activation patching — surgically substituting activations between runs to identify causally responsible layers — they produce a map of knowledge-bearing representations.
**Verify**: Causal probing distinguishes genuine latent knowledge from spurious correlations. This is the critical step CCS lacks: without causal verification, a probe may capture a distributional artifact rather than a stable semantic representation. This stage filters false positives.
**Elicit**: Representation engineering (Zou et al., 2023) surfaces hidden knowledge without modifying model weights. Unlike fine-tuning or raw activation addition steering, this approach is non-destructive and applicable at inference time.
### 3. The numbers that matter
- **84.7% average elicitation accuracy** across TruthfulQA + Deceptive Alignment benchmark + Quirky LM dataset - **+6.2% over CCS** (implying ~78.5% for CCS on these benchmarks) - **+9.1% over direct linear probing** (implying ~75.6% for direct linear probing) - **78.3% detection rate** in cases where the surface output is incorrect or evasive
That last figure is the most operationally significant: it quantifies the ability to recover correct knowledge precisely when the model is "lying" or evading. For safety applications, this is the core use case.
### 4. Safety implications and potential losers
**Deceptive alignment**: The dedicated benchmark tested in the paper targets the scenario where a model exhibits aligned behavior during evaluation while encoding divergent internal representations. MechELK detects these divergences with 78.3% recall. Not sufficient for production deployment as a safety certification tool, but an actionable signal for red-teaming workflows.
**What changes versus prior state**: Before MechELK, mechanistic interpretability tools (circuits, SAEs, activation patching) were used to *understand* model behavior, not to *actively extract* hidden knowledge. CCS was the reference for elicitation, but without mechanistic grounding. MechELK is the first framework to unify both branches.
**Potential losers**: - Purely behavioral safety evaluation approaches (prompt-only red-teaming) lose ground to methods that look inside the model - CCS as the reference baseline is directly challenged: -6.2% on the same benchmarks is a meaningful margin - Arguments that LLMs "don't know what they're saying" become harder to sustain: if 78.3% of surface errors hide correct internal knowledge, questions about output accountability shift considerably
**Limitations to flag**: This is a preprint (arXiv:2605.28825v1), not yet peer-reviewed. The benchmarks used — TruthfulQA, Quirky LM — carry known biases. Generalization to very large models (>70B) and non-transformer architectures is not demonstrated. SAE dependency introduces non-trivial computational overhead at inference. Finally, 78.3% detection means 21.7% of cases where correct latent knowledge is not recovered — a non-trivial gap for safety-critical applications.
Summary generated by Claude — human-verified