EnterpriseMem-Bench and AgingBench reach the same diagnosis from opposite ends: LLMs degrade as soon as they leave static context. Across 1,400 Text-to-SQL turns, GPT-5 mini and Claude Sonnet lose accuracy from turn 3 onward without explicit working memory — and Sonnet 4.6 regresses 17–33 percentage points on SEC EDGAR relative to Sonnet 4.5, suggesting benchmark gains don't transfer uniformly to high-symbol-density domains. AgingBench extends this picture across 14 models and ~400 runs: factual reliability degrades even when behavioral tests stay green, through four distinct mechanisms (compression, interference, revision, maintenance). For teams running long-lived agents, this is a direct warning about the limits of offline evaluation metrics.
On the post-training side, Self-Verified Distillation on Qwen3-4B shows a model can generate, filter, and train on its own data without any external pipeline: +16.7 points on AIME26/HMMT, +11.1 on GPQA Diamond, +8.3 on LCBv5/v6. The method uses cycle-consistency as a quality signal, making it applicable to any model with sufficient verification capacity. Read alongside ScientistOne — which hits 0 hallucinations across 75 scientific papers via Chain-of-Evidence — these are two different approaches (distillation vs. traceability) converging on the same goal: reducing factual drift without relying on additional human-labeled data.
SPEAR closes the loop on tooling: an agentic prompt optimizer that integrates a Python sandbox to analyze structural errors (confusion matrices, clustering) rather than relying solely on textual feedback. Across 13 LLM-as-judge tasks and BBH-7, it outperforms GEPA and TextGrad with a κ of 0.857 vs. 0.359 on tool selection. The +0.79κ gain attributed to the Python tool alone confirms that structural error analysis is an underexploited lever in automated prompt optimization.
EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark with 1,400 turns across 300 sessions, evaluates GPT-5 mini, GPT-5.2, Claude Sonnet 4.5/4.6, and Opus 4.6. Key findings: without memory, accuracy collapses by Turn 3; working memory dominates complex architectures; Sonnet 4.6 regresses 17-33pp on SEC EDGAR vs Sonnet 4.5.
SPEAR is an agentic prompt optimizer integrating a Python sandbox for structural error analysis (confusion matrices, clustering). Evaluated on 13 industrial LLM-as-judge tasks and BBH-7, it outperforms GEPA and TextGrad (κ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763). Python tool contributes +0.79κ on complex judge tasks.
Qwen3 improves reasoning via Self-Verified Distillation, a post-training algorithm requiring no external data. The model generates solutions, filters them through self-verification (cycle-consistency, factuality, correctness), then trains on self-curated data. Gains: +16.7 points math (AIME26/HMMT), +11.1 science (GPQA), +8.3 coding for Qwen3-4B.
ScientistOne, an autonomous research system, introduces Chain-of-Evidence (CoE) to trace every claim to its source. Evaluation across 75 papers: baseline systems show 21% hallucinated references, 42% score verification pass rate. ScientistOne achieves 0 hallucinations, perfect verification, and matches or exceeds human expert performance on five tasks.
AgingBench, a longitudinal reliability benchmark, measures how deployed AI agents degrade over time. Study across 14 models and ~400 runs shows reliability depends on four mechanisms: compression, interference, revision, and maintenance aging. Agents lose factual precision even when behavioral tests remain clean.