Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist
Signal
65
Hype
25
In three linesDeveloper proposes optimization for llama.cpp: dynamic KV cache quantization and on-demand mmproj loading. PoC implementation with HTTP endpoint /requantize_kvcache enabling config switching (quantized/f16 kvcache, mmproj on/off) without full model reload. Tested on RTX 5090 with Qwen3.5-27B Q6_K.Read source
Your take?
Summary generated by Claude — human-verified