Back to feed
Reddit r/MachineLearning·

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Signal
35
Hype
15
In three linesUser reports high E2E latency (3-5s) on fine-tuned Gemma 4 26B despite low TTFT (100-300ms) on H100 with vLLM and FP8 quantization. Exploring optimizations: speculative decoding (EAGLE/Medusa), draft models, or bottleneck investigation.
Read source
Your take?
GeminiFine-tuningInfrastructureTools

Summary generated by Claude — human-verified