Back to feed
Reddit r/LocalLLaMA·

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

Signal
78
Hype
25
In three linesFused MoE dispatch kernel written in pure Triton (no CUDA) achieves 89-131% of Megablocks performance on A100. Fuses gate+up projections to cut 35% memory traffic. Runs on AMD MI300X with zero code changes. Limitations: degraded performance beyond 2048 tokens and with 64+ experts.
Read source
Your take?
Open sourceInfrastructureCode generationBenchmarks

Summary generated by Claude — human-verified