Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?
Signal
45
Hype
25
In three linesUser reports Gemma 4 12B (unified audio/vision/text model) ignores audio input when system prompt exceeds ~21k tokens. Model works well with minimal prompt but generates generic/hallucinated responses with dense context. Behavior reproduced across vLLM, llama.cpp, and LiteRT-LM. Appears to be an inherent attention saturation limit.Read source
Your take?
Summary generated by Claude — human-verified