The highest-signal piece today comes from a comparative study spanning models from 4M to 4B parameters, covered by The Decoder. The finding is counterintuitive: small models don't fail on rare tasks because they lack raw capacity, but because frequent tasks continuously overwrite the gradients associated with rare ones during training. The practical implication is immediate — before scaling model size, increase the frequency of the target task in the data mix. This is a data curation lesson, not a scaling one.
On a separate front, GraphKV (r/LocalLLaMA) proposes KV cache compression via graph embeddings with concrete numbers: 7.76x on GPT-2 (cosine similarity 0.999949) and 3.36x on Qwen2.5-7B at 32k tokens (cosine 0.990316), using int2/int4/NF4 quantization. These figures are notable for an experimental open-source project. Read alongside the task-frequency study, both articles converge on the same underlying pressure: optimize compute utilization before increasing it.
The tokenomics study on agentic software engineering systems (Hacker News) rounds out the picture by quantifying where tokens are actually consumed in autonomous coding workflows — a measurement angle that remains underinstrumented. The mech interp experiment on Qwen3.5-35B-A3B (Expert 114, layer 14) is anecdotal at this stage but reflects growing interest in the internal decomposition of MoE routing, specifically the correlation between a routed expert and a first-person self-examination register during generation.
A study comparing models from 4M to 4B parameters reveals small models fail at rare tasks because frequent ones constantly overwrite learned skills. A practical solution: increase target task frequency in training data rather than scaling up the model.
GraphKV, KV cache compression project using graph embedding models. Achieves 7.76x compression on GPT-2 (cosine 0.999949), 3.36x on Qwen2.5-7B 32k tokens (cosine 0.990316). Inspired by TurboQuant, uses int2/int4/NF4 quantization.
open-deepthink adds knowledge distillation mode using Qualitative Neural Networks (QNN). Agents arranged in layers evolve via Mirror Descent and mutation, generating structured JSON datasets with developmental traces, agent reasoning, and evolutionary history for fine-tuning local LLMs.
Mechanistic interpretability experiment on Qwen3.5-35B-A3B: a routed expert (E114, layer 14) correlates with first-person self-examination register during generation. Author documents results before git release, using W/S/Q decomposition of MoE routing.
Study quantifying token distribution in agentic AI systems for software engineering. Analyzes where and how tokens are consumed across autonomous agent workflows.