Back to feed
arXiv cs.LG·

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Signal
82
Hype
18
In three linesProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight in-family small model evaluates KV cache importance asynchronously via HybridAxialMapper and Multi-Granularity Hybrid Loss. On Llama-3.1, Qwen-2.5, and Qwen-3, recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains speedup at contexts up to 170k tokens.
Read source
Your take?
LlamaQwenReasoningInfrastructure

Summary generated by Claude — human-verified