Reddit r/LocalLLaMA·10 June 2026

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Signal

Hype

In three linesUser reports Gemma 4 12B (unified audio/vision/text model) ignores audio input when system prompt exceeds ~21k tokens. Model works well with minimal prompt but generates generic/hallucinated responses with dense context. Behavior reproduced across vLLM, llama.cpp, and LiteRT-LM. Appears to be an inherent attention saturation limit.

Read source

Your take?

Gemini Voice Multi-agent Prompt engineering

Summary generated by Claude — human-verified

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Other angles on this story