It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.
I'm not sure it applies so well in LLMs though (should read the paper...).
I'm not sure it applies so well in LLMs though (should read the paper...).