Comment by bytepoet - Hacker Neue

bytepoet Jun 19, 2025 parent

This is very cool. I enjoyed going through the writeup and GitHub README.

I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.

I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:

FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667

zhihaojia Jun 19, 2025

Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.

Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!

bytepoet OP Jun 20, 2025

Thanks for the inputs. It's very helpful to know.

I look forward to following mirage development.

ActorNightly Jun 19, 2025

Personally I think its a bit of a waste to invest time into gradient training optimizations. A lot of training tasks IRL have discrete values in nature, which can't be trained with gradients.

This item has no comments currently.