Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.
Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!
Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!