Comment by touisteur - Hacker Neue

touisteur 5 days ago parent

It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.

I'm not sure it applies so well in LLMs though (should read the paper...).

This item has no comments currently.