Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.
Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!
I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.
I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:
FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667