ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
In three linesllama.cpp PR improves matmul performance for k-quants via WebGPU. Speedups measured on M2 Pro: Q2_K 2.44x, Q3_K 3.27-3.78x, Q4_K 1.34-1.36x, Q5_K 1.33x, Q6_K 1.44-1.52x in prefill (pp512).
## WebGPU + llama.cpp: k-quants finally get proper matmul performance
### What actually changes
PR #24225 by yomaytk in ggml-org/llama.cpp refactors the matmul pipeline for k-quant formats (Q2_K through Q6_K) under the WebGPU backend. All measurements are on pp512 (512-token prefill) on M2 Pro — hardware representative of the high-end local developer segment.
Raw numbers: Q3_K goes from 92.54 t/s to 302.24 t/s on Qwen3.5 4B (+3.27x), and from 79.06 t/s to 298.73 t/s on Gemma4 E4B (+3.78x). Q2_K on Qwen3 0.6B jumps from 817.86 t/s to 1991.81 t/s (+2.44x). Gains on Q4_K, Q5_K, and Q6_K are more modest but consistent: between +1.33x and +1.52x depending on the model.
### Why k-quants were underperforming on WebGPU
K-quants (the block-wise compression format with super-blocks introduced by llama.cpp) have a more complex decode structure than simple Q4_0 or Q8_0 quantizations. Each block requires reconstructing scales and min-values from compressed bits before the matrix multiply can proceed. On Metal and CUDA, dedicated kernels have handled this efficiently for a long time. On WebGPU, the ggml-webgpu backend was using generic fallback paths — hence the massive gap on Q3_K in particular.
The PR introduces specialized WGSL shaders for each k-quant format, with a matmul dispatch refactor that adapts workgroup sizes to the block structure. The asymmetric gains (3.78x on Q3_K vs 1.34x on Q4_K) are explained by the fact that Q3_K had the most suboptimal path before, while Q4_K already benefited partially from earlier optimizations.
### Context: why WebGPU matters now
WebGPU is no longer an experimental backend. Since late 2024, it is the only viable path for running quantized LLMs directly in the browser with GPU acceleration on Windows (no Metal, no CUDA in the browser). It also covers Intel Arc GPUs, AMD iGPUs, and configurations without CUDA drivers — a non-trivial fraction of developer and end-user machines.
Before this PR, using Q3_K via WebGPU on a 4B model yielded ~79-92 t/s prefill on M2 Pro. Acceptable for token-by-token generation, but painful for long contexts or batch processing. At ~300 t/s, prefilling a 4096-token context drops from ~44s to ~13s — a perceptible difference in any interactive workflow.
### Who loses in this equation
Competing browser-side backends: **WebLLM** (Apache TVM/MLC-based) and **Transformers.js** (ONNX Runtime Web) previously held a performance edge on k-quants through their own compilation pipelines. This PR narrows that gap and reinforces llama.cpp as the single reference covering CPU, Metal, CUDA, Vulkan, and WebGPU within one codebase.
On the hardware side, GPUs without fast native integer operation support (some older mobile GPUs, WebGPU on CPU fallback) will not see the same gains — the optimized WGSL shaders assume real GPU dispatch.
### Limitations and what's still missing
Benchmarks are exclusively on M2 Pro. Performance on discrete GPUs (NVIDIA via WebGPU/Dawn, AMD RDNA) is not documented in the PR — gains could vary significantly depending on shader compute unit architecture. The test covers pp512 (prefill) but not tg128 (generation): decode speedups are likely smaller since that regime is memory-bandwidth-bound rather than compute-bound.
Q8_0 and non-k-quant formats (Q4_0, Q5_0) are mentioned in the PR title as refactored, but published benchmarks only cover k-quants — gains on those formats remain independently unverified.
### Signal for practitioners
If you deploy llama.cpp via WebGPU (wasm builds, Electron apps, or Node.js servers with GPU backend), this PR warrants immediate testing once merged. Prioritize Q3_K on 4B-7B models if prefill latency is your primary constraint: the quality/speed ratio becomes substantially more favorable. For sub-1B models (Qwen3 0.6B), Q2_K at ~2000 t/s prefill opens real-time use cases that were not practical before.
Summary generated by Claude — human-verified