Back to feed
Reddit r/LocalLLaMA·

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

Signal
82
Hype
15
In three linesllama.cpp b9455 merges a major fix for KV cache quantization in tensor mode on multi-GPU. The solution extends the meta backend to properly handle tensor flattening without losing shape information, avoiding changes to compute graphs.

## llama.cpp b9455: Quantized KV Cache Finally Works with --sm tensor on Multi-GPU

### The Precise Technical Problem

Prior to b9455, combining `--split-mode tensor` with KV cache quantization on multi-GPU setups produced incorrect outputs or silent failures. Root cause: during KV cache rotation (RoPE operation), llama.cpp flattens tensors for performance reasons. That flattening destroys the shape metadata the meta backend relies on to orchestrate compute distribution across GPUs. The meta backend could no longer reconstruct the original tensor topology after a reshape, making the two features mutually exclusive in practice.

### Why Previous Fixes Were Inadequate

Earlier PRs attempted a downstream fix: modify the KV cache rotation shapes to avoid the problematic flattening. JohannesGaessler explicitly rejects this approach for a critical performance reason — batched matrix multiplications are less well-supported across ggml backends than a single large matrix multiplication. Forcing the compute graph to conform to the meta backend's limitations is technical debt: it treats the symptom while potentially degrading performance on specific backends (Metal, CUDA, Vulkan).

### The Chosen Solution: Extend the Meta Backend

The PR extends the `ggml_backend_meta_split_state` specification with a new field encoding how many times a given segment repeats. In practice: when a tensor is flattened, the meta backend now encodes the repetition structure within its memory layout descriptor segments. On a subsequent reshape, it can reconstruct the correct layout without requiring the compute graph to explicitly carry the lost shape information. No changes to llama.cpp compute graphs are needed — the fix is entirely contained within the ggml abstraction layer.

### Concrete Operational Impact

The `--split-mode tensor` + quantized KV cache combination (Q4_0, Q8_0, etc.) is particularly relevant for heterogeneous or memory-constrained multi-GPU setups. `--sm tensor` distributes individual layers across GPUs rather than whole layers, enabling more granular use of GPUs with different VRAM capacities. KV cache quantization reduces memory footprint by 50–75% depending on quantization level — critical for fitting long contexts (128K+) on consumer hardware.

Before b9455, a user running Llama 3.1 70B at 32K context on two 16 GB GPUs with quantized KV cache had to choose: either `--sm tensor` for fine-grained distribution, or KV quantization — never both. That trade-off is now eliminated.

### Who Loses Here

Proprietary multi-GPU serving solutions (vLLM, TensorRT-LLM, SGLang) see llama.cpp close a meaningful functional gap on heterogeneous hardware configurations. llama.cpp remains less competitive on raw throughput for homogeneous A100/H100 clusters, but its advantage on consumer and semi-professional hardware (RTX 3090/4090, mixed configurations) strengthens further. Users who built manual workarounds — custom sharding scripts, tensor padding — will need to revalidate their pipelines.

### Project Velocity Context

The "Them boys can cook, one big fix after another" comment reflects a measurable reality: llama.cpp has maintained a near-daily release cadence for months, with build numbers now exceeding 9400. This specific fix required deep understanding of the interaction between three distinct layers — the llama.cpp compute graph, the ggml abstraction, and specific GPU backends — which explains why multiple PR attempts were needed before arriving at an architecturally clean solution.

**Immediate action for practitioners**: if you run multi-GPU setups with `--sm tensor`, update to b9455+ and enable `--cache-type-k q8_0` or `--cache-type-k q4_0` depending on your quality/memory tolerance.

Read source
Your take?
LlamaOpen sourceInfrastructure

Summary generated by Claude — human-verified