Comment by kcorbitt - Hacker Neue

kcorbitt May 28, 2025 parent

It seems like the speedups here are most useful for small models, since on larger models a smaller fraction of the total time would be spent swapping between kernels? Would be interesting to see at least theoretical results for LLMs in the 14-70B parameter range, which is what most folks deploy in practice.

And of course the effect on throughput at larger batch sizes, which they allude to at the end.

Overall a very interesting result!

ptrj_ May 28, 2025

This could also give a nice speedup for MoE models w/ total 7B-70B parameters but O(10x) fewer active params, e.g. https://huggingface.co/Qwen/Qwen3-30B-A3B, assuming the expert router can be effectively scheduled within the monolithic mega-kernel.

mmoskal May 28, 2025

They are reducing forward pass time from say 1.5ms to 1ms. On bigger model you would likely reduce from 15ms to 14.2ms or something like that.

This item has no comments currently.