Edition of2026-05-31

MTP baked into GGUF, Apple Silicon inference finally benchmarked properly, and search agents that mostly confirm what they already know.

Mudler today released APEX GGUF quantizations of Qwen3.6-35B-A3B distilled on Claude-4.7-Opus with a multi-token prediction (MTP) head baked directly into the file. The practical upside: self-contained speculative decoding works in llama.cpp with no separate draft model to manage. Overhead is minimal — +2.5% file size, MTP head quantized at Q8_0. This kind of packaging lowers the adoption friction for anyone wanting local speculative decoding without extra orchestration.

On the Apple Silicon side, two signals converge: the launch of mlx-Chronos (an open-source CLI and community leaderboard measuring TTFT, throughput, RAM, and thermal state across oMLX, Rapid-MLX, mlx-lm, and Ollama) and a concrete benchmark on M1 Max 64GB with Qwen 3.5-4B that puts rapid-mlx ahead on both speed and memory efficiency. The leaderboard is still sparse — only M2 8GB results so far — but the standardized methodology is in place, which is exactly what was missing to seriously compare MLX inference engines against each other.

The most structurally significant piece today comes from Harbin Institute of Technology's LiveBrowseComp study: GPT-5.4 and Kimi K2.6, tested on events from the past 90 days, mostly confirm their training knowledge rather than actually browsing the web. Block access to training memory and performance collapses. This isn't a prompt engineering bug — it's an architectural issue with how search agents are built, and it matters before deploying these systems on any use case that genuinely requires up-to-date information.

Today's 5 picks
01
02
03
04
05