Three of today's five papers attack the same problem from different angles: reducing active context without degrading accuracy. IntentKV compresses the KV cache of multi-turn agents from 92.3k to 20.5k tokens (−77.8%) on Qwen2.5-14B by scoring historical tokens through a cross-turn intent memory. Engram goes further on the qualitative side: by retrieving ~9.6k tokens via a bi-temporal knowledge graph, it reaches 83.6% on LongMemEval_S versus 73.2% for the full history — 8x fewer tokens, +10.4 accuracy points. Prefilling-dLLM applies the same logic to diffusion language models, achieving 9.1–28x speedup on 8K–32K contexts. The converging signal: exhaustive context is a lazy heuristic, not an optimum.
On the evaluation front, ComBench (100 Olympiad-level combinatorics problems) reveals a clear dissociation between proof capability and construction capability in LLMs. Kimi-K2.6 leads on explicit constructions, GPT-4o on formal proofs; the best average score caps at 65.4% (75.3% Best@4). This dual-axis benchmark design is more diagnostically useful than aggregate scores for identifying where a model structurally fails.
CodeAlchemy deserves separate attention: 500B+ synthetic tokens generated via five rewriting strategies — including CodeTrace, which instruments 1.3M files to capture real control flow — allow a 3B model to outperform Gemma-3 27B and Granite-4.0 32B on HumanEval (83.5%) and MBPP (63.2%). This is a direct demonstration that synthetic data quality and diversity override model size on coding tasks, and that data generation pipelines are becoming as strategically important as architecture choices.
ComBench is a benchmark of 100 Olympiad-level combinatorics problems to evaluate LLM mathematical reasoning. It distinguishes analysis-centric problems (rigorous proofs) from construction-centric problems (explicit constructions). Top models reach 65.4% average and 75.3% Best@4. Kimi-K2.6 outperforms GPT-4o on constructions but trails on proof grading.
CodeAlchemy generates 500B+ synthetic tokens via 5 strategies (CodeEnhance, CodeQA, CodeDev, CodeDialogue, CodeTrace) from public code across 15 languages. CodeTrace instruments 1.3M+ files to capture control flow and library knowledge. 3B models outperform 10x larger models (Gemma-3 27B, Granite-4.0 32B): 83.5% HumanEval, 63.2% MBPP.
IntentKV is a KV cache pruning technique for multi-turn LLM agents. It maintains cross-turn intent memory and uses memory-attention rules to score historical tokens. On Qwen2.5-14B with 8k budget, it reduces peak request tokens from 92.3k to 20.5k (−77.8%) and KV reads from 411M to 31M (−92.6%) with minimal accuracy loss.
Engram, an open-source dual-process memory engine for LLM agents, uses a bi-temporal knowledge graph to outperform full-context baselines. On LongMemEval_S (500 questions), the lean configuration retrieves ~9.6k tokens and achieves 83.6% vs 73.2% for full history (+10.4 points, p<10^-6), using 8x fewer tokens.
Prefilling-dLLM optimizes diffusion language model inference by partitioning context into chunks, caching their KV representations, and selecting relevant chunks with intra-chunk token sparsity. Achieves 9.1–28.0x speedup on 8K–32K contexts without full prefix re-encoding.