arXiv cs.AI·19 May 2026

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Signal

Hype

In three linesOSCAR quantizes KV caches to INT2 for long-context LLMs by estimating attention-aware covariance structures offline. Tested on Qwen3 (4B–32B) and GLM-4.7 (358B), it reduces accuracy gap to 1.42–3.78 points vs BF16, cuts memory by 8x and improves throughput by 7x. Custom INT2 kernel compatible with vLLM/SGLang.

Read source

Your take?

Reasoning Benchmarks Infrastructure

Summary generated by Claude — human-verified

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Other angles on this story