Mudler today released APEX GGUF quantizations of Qwen3.6-35B-A3B distilled on Claude-4.7-Opus with a multi-token prediction (MTP) head baked directly into the file. The practical upside: self-contained speculative decoding works in llama.cpp with no separate draft model to manage. Overhead is minimal — +2.5% file size, MTP head quantized at Q8_0. This kind of packaging lowers the adoption friction for anyone wanting local speculative decoding without extra orchestration.
On the Apple Silicon side, two signals converge: the launch of mlx-Chronos (an open-source CLI and community leaderboard measuring TTFT, throughput, RAM, and thermal state across oMLX, Rapid-MLX, mlx-lm, and Ollama) and a concrete benchmark on M1 Max 64GB with Qwen 3.5-4B that puts rapid-mlx ahead on both speed and memory efficiency. The leaderboard is still sparse — only M2 8GB results so far — but the standardized methodology is in place, which is exactly what was missing to seriously compare MLX inference engines against each other.
The most structurally significant piece today comes from Harbin Institute of Technology's LiveBrowseComp study: GPT-5.4 and Kimi K2.6, tested on events from the past 90 days, mostly confirm their training knowledge rather than actually browsing the web. Block access to training memory and performance collapses. This isn't a prompt engineering bug — it's an architectural issue with how search agents are built, and it matters before deploying these systems on any use case that genuinely requires up-to-date information.
Mudler releases APEX GGUF quantizations of Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with bundled MTP (multi-token prediction) head. Files enable self-speculative decoding via llama.cpp without separate draft model. Size +2.5% vs non-MTP version, MTP head quantized Q8_0 for high draft accuracy.
mlx-Chronos is an open-source CLI tool and community leaderboard to benchmark local LLM inference engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama). Measures TTFT, throughput, RAM, and thermal state with standardized methodology. Currently populated only with M2 8GB results.
AI search agents like GPT-5.4 and Kimi K2.6 mostly confirm their training knowledge rather than genuinely researching the web. Researchers at Harbin Institute of Technology demonstrated this using LiveBrowseComp, a benchmark based on events from the last 90 days. Without relying on training memory, performance collapses.
Benchmark of inference engines on M1 Max 64GB comparing rapid-mlx, omlx, mlx-lm, and ollama with Qwen 3.5-4B. Rapid-mlx leads on speed and memory efficiency. Results submitted to mlx-chronos community leaderboard.
Komi-learn is a framework for coding agents with continuous memory and self-improvement capabilities. The project enables agents to learn from past experiences and improve performance over time.