High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]
Signal
35
Hype
15
In three linesUser reports high E2E latency (3-5s) on fine-tuned Gemma 4 26B despite low TTFT (100-300ms) on H100 with vLLM and FP8 quantization. Exploring optimizations: speculative decoding (EAGLE/Medusa), draft models, or bottleneck investigation.Read source
Your take?
Summary generated by Claude — human-verified