Fair enough. I should have clarified that “approximately the cost of a single kernel launch” is pretty much what I meant by “tiny”.
What I was getting at was that a “megakernel” and a captured graph should have similar launch costs.
It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.
I'm not sure it applies so well in LLMs though (should read the paper...).
Absolutely not; it’s comparable to the launch overhead of a kernel.