llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp
Signal
65
Hype
15
In three linesPR on llama.cpp limiting logits memory allocation in llama_context. With -ub 2048 and MTP, saves 1.2GB VRAM. Proposes API to reserve logits space only for needed n_seqs, defaults to all tokens but configurable to 1 in server-context.Read source
Your take?
Summary generated by Claude — human-verified