Back to feed
arXiv cs.AI·

RAP: Runtime Adaptive Pruning for LLM Inference

Signal
78
Hype
15
In three linesRAP is an elastic pruning framework for LLM inference using reinforcement learning to dynamically adapt compression strategies based on runtime memory variations and heterogeneous KV-cache demands. The RL agent optimizes the parameter-to-KV-cache ratio in real-time, retaining only components that maximize utility within the current memory budget.
Read source
Your take?
Reinforcement learningInfrastructureBenchmarks

Summary generated by Claude — human-verified