Back to feed
arXiv cs.AI·

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Signal
78
Hype
15
In three linesProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight Small-Model Proxy asynchronously scores KV cache importance for the target model. Tested on Llama-3.1, Qwen-2.5, and Qwen-3: recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains gains up to 170k tokens.
Read source
Your take?
LlamaQwenReasoningBenchmarksInfrastructure

Summary generated by Claude — human-verified