ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference
Signal
82
Hype
18
In three linesProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight in-family small model evaluates KV cache importance asynchronously via HybridAxialMapper and Multi-Granularity Hybrid Loss. On Llama-3.1, Qwen-2.5, and Qwen-3, recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains speedup at contexts up to 170k tokens.Read source
Your take?
Summary generated by Claude — human-verified