Edition of2026-06-10

Context compression and reasoning evaluation: two structural research axes on June 10

By the editorial team

Today's 5 picks

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

ComBench is a benchmark of 100 Olympiad-level combinatorics problems to evaluate LLM mathematical reasoning. It distinguishes analysis-centric problems (rigorous proofs) from construction-centric problems (explicit constructions). Top models reach 65.4% average and 75.3% Best@4. Kimi-K2.6 outperforms GPT-4o on constructions but trails on proof grading.

Benchmarks Reasoning Evals

arXiv cs.CL·SIG 82

CodeAlchemy: Synthetic Code Rewriting at Scale

CodeAlchemy generates 500B+ synthetic tokens via 5 strategies (CodeEnhance, CodeQA, CodeDev, CodeDialogue, CodeTrace) from public code across 15 languages. CodeTrace instruments 1.3M+ files to capture control flow and library knowledge. 3B models outperform 10x larger models (Gemma-3 27B, Granite-4.0 32B): 83.5% HumanEval, 63.2% MBPP.

Code generation Benchmarks Fine-tuning

arXiv cs.LG·SIG 82

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

IntentKV is a KV cache pruning technique for multi-turn LLM agents. It maintains cross-turn intent memory and uses memory-attention rules to score historical tokens. On Qwen2.5-14B with 8k budget, it reduces peak request tokens from 92.3k to 20.5k (−77.8%) and KV reads from 411M to 31M (−92.6%) with minimal accuracy loss.

AI Agents Reasoning Infrastructure

arXiv cs.CL·SIG 82

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Engram, an open-source dual-process memory engine for LLM agents, uses a bi-temporal knowledge graph to outperform full-context baselines. On LongMemEval_S (500 questions), the lean configuration retrieves ~9.6k tokens and achieves 83.6% vs 73.2% for full history (+10.4 points, p<10^-6), using 8x fewer tokens.

AI Agents Reasoning Benchmarks

arXiv cs.CL·SIG 78

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Prefilling-dLLM optimizes diffusion language model inference by partitioning context into chunks, caching their KV representations, and selecting relevant chunks with intra-chunk token sparsity. Achieves 9.1–28.0x speedup on 8K–32K contexts without full prefix re-encoding.

Reasoning Benchmarks Infrastructure