arXiv cs.AI·19 May 2026

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Signal

Hype

In three linesProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight Small-Model Proxy asynchronously scores KV cache importance for the target model. Tested on Llama-3.1, Qwen-2.5, and Qwen-3: recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains gains up to 170k tokens.

Read source

Your take?

Llama Qwen Reasoning Benchmarks Infrastructure

Summary generated by Claude — human-verified

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Other angles on this story