Back to feed
Reddit r/MachineLearning·

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]

Signal
78
Hype
25
In three linesTritonMoE: pure Triton MoE kernel for portable NVIDIA/AMD inference without vendor-specific code. Fused gate+up GEMM reduces memory traffic by 35%. Achieves 89-131% of Megablocks throughput (batch ≤512 tokens) on A100, same kernel runs on MI300X. Limitations: degrades at 2048+ tokens and with 64+ experts.
Read source
Your take?
BenchmarksOpen source

Summary generated by Claude — human-verified