Back to feed
arXiv cs.AI·

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Signal
82
Hype
15
In three linesOSCAR quantizes KV caches to INT2 for long-context LLMs by estimating attention-aware covariance structures offline. Tested on Qwen3 (4B–32B) and GLM-4.7 (358B), it reduces accuracy gap to 1.42–3.78 points vs BF16, cuts memory by 8x and improves throughput by 7x. Custom INT2 kernel compatible with vLLM/SGLang.
Read source
Your take?
ReasoningBenchmarksInfrastructure

Summary generated by Claude — human-verified