Back to feed
Reddit r/LocalLLaMA·

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

Signal
65
Hype
15
In three linesPR on llama.cpp limiting logits memory allocation in llama_context. With -ub 2048 and MTP, saves 1.2GB VRAM. Proposes API to reserve logits space only for needed n_seqs, defaults to all tokens but configurable to 1 in server-context.
Read source
Your take?
LlamaOpen sourceInfrastructure

Summary generated by Claude — human-verified