Reddit r/MachineLearning·21 May 2026

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Signal

Hype

In three linesUser reports high E2E latency (3-5s) on fine-tuned Gemma 4 26B despite low TTFT (100-300ms) on H100 with vLLM and FP8 quantization. Exploring optimizations: speculative decoding (EAGLE/Medusa), draft models, or bottleneck investigation.

Read source

Your take?

Gemini Fine-tuning Infrastructure Tools

Summary generated by Claude — human-verified

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Other angles on this story