Reddit r/LocalLLaMA·27 May 2026

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

Signal

Hype

In three linesFused MoE dispatch kernel written in pure Triton (no CUDA) achieves 89-131% of Megablocks performance on A100. Fuses gate+up projections to cut 35% memory traffic. Runs on AMD MI300X with zero code changes. Limitations: degraded performance beyond 2048 tokens and with 64+ experts.

Read source

Your take?

Open source Infrastructure Code generation Benchmarks

Summary generated by Claude — human-verified

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

Other angles on this story