Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]
Signal
82
Hype
25
In three linesOptimized monokernel for LLM inference on AMD MI300X: 3,300 output tokens/s per request (batch 1, no speculative decoding). Architecture mapped to GPU physical topology. Initial support for 2B model, frontier MoE planned.Read source
Your take?
Summary generated by Claude — human-verified