arXiv cs.LG·19 May 2026

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Signal

Hype

In three linesProxyKV introduces a cross-model proxy pruning framework to accelerate long-context LLM inference. A lightweight in-family small model evaluates KV cache importance asynchronously via HybridAxialMapper and Multi-Granularity Hybrid Loss. On Llama-3.1, Qwen-2.5, and Qwen-3, recovers 98.7% of KVZip accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) and sustains speedup at contexts up to 170k tokens.

Read source

Your take?

Llama Qwen Reasoning Infrastructure

Summary generated by Claude — human-verified

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Other angles on this story