Back to feed
Reddit r/MachineLearning·

Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

Signal
82
Hype
25
In three linesOptimized monokernel for LLM inference on AMD MI300X: 3,300 output tokens/s per request (batch 1, no speculative decoding). Architecture mapped to GPU physical topology. Initial support for 2B model, frontier MoE planned.
Read source
Your take?
InfrastructureCode generationBenchmarks

Summary generated by Claude — human-verified