Back to feed
Reddit r/LocalLLaMA·

Experts first llama.cpp

Signal
65
Hype
25
In three linesExperimental llama.cpp fork optimizing MoE for 12GB VRAM GPUs. Author selectively loads experts to VRAM instead of full layers, reaching 26 tk/s on RTX 2060 (vs 19 tk/s default) with 62% hit rate. Seeking testers on 3060/4060.
Read source
Your take?
LlamaOpen sourceInfrastructureCode generation

Summary generated by Claude — human-verified