Edition of2026-05-27

Memory, self-distillation, and agent aging: three angles on LLM reliability in production

EnterpriseMem-Bench and AgingBench reach the same diagnosis from opposite ends: LLMs degrade as soon as they leave static context. Across 1,400 Text-to-SQL turns, GPT-5 mini and Claude Sonnet lose accuracy from turn 3 onward without explicit working memory — and Sonnet 4.6 regresses 17–33 percentage points on SEC EDGAR relative to Sonnet 4.5, suggesting benchmark gains don't transfer uniformly to high-symbol-density domains. AgingBench extends this picture across 14 models and ~400 runs: factual reliability degrades even when behavioral tests stay green, through four distinct mechanisms (compression, interference, revision, maintenance). For teams running long-lived agents, this is a direct warning about the limits of offline evaluation metrics.

On the post-training side, Self-Verified Distillation on Qwen3-4B shows a model can generate, filter, and train on its own data without any external pipeline: +16.7 points on AIME26/HMMT, +11.1 on GPQA Diamond, +8.3 on LCBv5/v6. The method uses cycle-consistency as a quality signal, making it applicable to any model with sufficient verification capacity. Read alongside ScientistOne — which hits 0 hallucinations across 75 scientific papers via Chain-of-Evidence — these are two different approaches (distillation vs. traceability) converging on the same goal: reducing factual drift without relying on additional human-labeled data.

SPEAR closes the loop on tooling: an agentic prompt optimizer that integrates a Python sandbox to analyze structural errors (confusion matrices, clustering) rather than relying solely on textual feedback. Across 13 LLM-as-judge tasks and BBH-7, it outperforms GEPA and TextGrad with a κ of 0.857 vs. 0.359 on tool selection. The +0.79κ gain attributed to the Python tool alone confirms that structural error analysis is an underexploited lever in automated prompt optimization.

Today's 5 picks
01
02
03
04
05
Memory, self-distillation, and agent aging: three angles on LLM reliability in production · Signal IA