Edition of2026-06-07

LLM emergent capabilities explained by task frequency, not scale — and that reshapes training strategy

The highest-signal piece today comes from a comparative study spanning models from 4M to 4B parameters, covered by The Decoder. The finding is counterintuitive: small models don't fail on rare tasks because they lack raw capacity, but because frequent tasks continuously overwrite the gradients associated with rare ones during training. The practical implication is immediate — before scaling model size, increase the frequency of the target task in the data mix. This is a data curation lesson, not a scaling one.

On a separate front, GraphKV (r/LocalLLaMA) proposes KV cache compression via graph embeddings with concrete numbers: 7.76x on GPT-2 (cosine similarity 0.999949) and 3.36x on Qwen2.5-7B at 32k tokens (cosine 0.990316), using int2/int4/NF4 quantization. These figures are notable for an experimental open-source project. Read alongside the task-frequency study, both articles converge on the same underlying pressure: optimize compute utilization before increasing it.

The tokenomics study on agentic software engineering systems (Hacker News) rounds out the picture by quantifying where tokens are actually consumed in autonomous coding workflows — a measurement angle that remains underinstrumented. The mech interp experiment on Qwen3.5-35B-A3B (Expert 114, layer 14) is anecdotal at this stage but reflects growing interest in the internal decomposition of MoE routing, specifically the correlation between a routed expert and a first-person self-examination register during generation.

Today's 5 picks
01
02
03
04
05
LLM emergent capabilities explained by task frequency, not scale — and that reshapes training strategy · Signal IA