Back to feed
Reddit r/LocalLLaMA·

Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

Signal
82
Hype
15
In three linesFlash Attention optimization for llama.cpp on RDNA3 GPUs: 47% VRAM reduction vs Vulkan f16. Packs four 8-bit K-values into native sudot4 instructions without lossy quantization. At 128k context with MTP draft: 21.76 GiB vs 23.18 GiB (1.42 GiB savings). Quality preserved: mean KLD 0.00455 (q4_0 V), 97.06% identical top tokens.
Read source
Your take?
LlamaCode generationBenchmarksInfrastructure

Summary generated by Claude — human-verified