Reddit r/MachineLearning·27 May 2026

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]

Signal

Hype

In three linesTritonMoE: pure Triton MoE kernel for portable NVIDIA/AMD inference without vendor-specific code. Fused gate+up GEMM reduces memory traffic by 35%. Achieves 89-131% of Megablocks throughput (batch ≤512 tokens) on A100, same kernel runs on MI300X. Limitations: degrades at 2048+ tokens and with 64+ experts.

Read source

Your take?

Benchmarks Open source

Summary generated by Claude — human-verified

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]

Other angles on this story