Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!
Signal
72
Hype
35
In three linesLocal inference of DeepSeek-V4-Flash (284B, 13B active) on 4x RTX 2080 Ti (~$2.5k). Custom Turing CUDA kernels + W8A8 quantization + heterogeneous offloading achieve 255 tok/s prefill. Open-sourced on GitHub.Read source
Your take?
Summary generated by Claude — human-verified