Shard - getting to 10× KV cache compression
Signal
75
Hype
25
In three linesShard is a HuggingFace Cache achieving 10× KV memory compression for Llama-3.1-8B at 8K context (11× at 32K) with no measurable impact on NIAH/LongBench. Uses PCA + int4 quantization on K and Hadamard rotation + vector quantization on V. Attention runs directly on compressed K.Read source
Your take?
Summary generated by Claude — human-verified