Reddit r/MachineLearning·18 May 2026

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Signal

Hype

In three linesCustom CUDA runtime for small-batch inference (robotics, VLA, world models). Bottlenecks are not GEMM alone but runtime overhead: kernel fragmentation, layout transitions, precision conversions (FP8/FP4), Python scheduling. Results: Pi0.5 on RTX 5090 ~17.6ms, GROOT N1.6 ~12.5-13.1ms, Qwen 27B ~129 tok/s.

Read source

Your take?

Code generation Infrastructure Robotics Benchmarks

Summary generated by Claude — human-verified

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Other angles on this story