Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]
Signal
78
Hype
25
In three linesCustom CUDA runtime for small-batch inference (robotics, VLA, world models). Bottlenecks are not GEMM alone but runtime overhead: kernel fragmentation, layout transitions, precision conversions (FP8/FP4), Python scheduling. Results: Pi0.5 on RTX 5090 ~17.6ms, GROOT N1.6 ~12.5-13.1ms, Qwen 27B ~129 tok/s.Read source
Your take?
Summary generated by Claude — human-verified