Edition of2026-06-10

Context compression and reasoning evaluation: two structural research axes on June 10

Three of today's five papers attack the same problem from different angles: reducing active context without degrading accuracy. IntentKV compresses the KV cache of multi-turn agents from 92.3k to 20.5k tokens (−77.8%) on Qwen2.5-14B by scoring historical tokens through a cross-turn intent memory. Engram goes further on the qualitative side: by retrieving ~9.6k tokens via a bi-temporal knowledge graph, it reaches 83.6% on LongMemEval_S versus 73.2% for the full history — 8x fewer tokens, +10.4 accuracy points. Prefilling-dLLM applies the same logic to diffusion language models, achieving 9.1–28x speedup on 8K–32K contexts. The converging signal: exhaustive context is a lazy heuristic, not an optimum.

On the evaluation front, ComBench (100 Olympiad-level combinatorics problems) reveals a clear dissociation between proof capability and construction capability in LLMs. Kimi-K2.6 leads on explicit constructions, GPT-4o on formal proofs; the best average score caps at 65.4% (75.3% Best@4). This dual-axis benchmark design is more diagnostically useful than aggregate scores for identifying where a model structurally fails.

CodeAlchemy deserves separate attention: 500B+ synthetic tokens generated via five rewriting strategies — including CodeTrace, which instruments 1.3M files to capture real control flow — allow a 3B model to outperform Gemma-3 27B and Granite-4.0 32B on HumanEval (83.5%) and MBPP (63.2%). This is a direct demonstration that synthetic data quality and diversity override model size on coding tasks, and that data generation pipelines are becoming as strategically important as architecture choices.

Today's 5 picks
01
02
03
04
05