Back to feed
Reddit r/LocalLLaMA·

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Signal
45
Hype
25
In three linesUser reports Gemma 4 12B (unified audio/vision/text model) ignores audio input when system prompt exceeds ~21k tokens. Model works well with minimal prompt but generates generic/hallucinated responses with dense context. Behavior reproduced across vLLM, llama.cpp, and LiteRT-LM. Appears to be an inherent attention saturation limit.
Read source
Your take?
GeminiVoiceMulti-agentPrompt engineering

Summary generated by Claude — human-verified