Preferences

saagarjha parent
> The CPU launch cost of a graph is tiny

Absolutely not; it’s comparable to the launch overhead of a kernel.


Fair enough. I should have clarified that “approximately the cost of a single kernel launch” is pretty much what I meant by “tiny”.

What I was getting at was that a “megakernel” and a captured graph should have similar launch costs.

touisteur
It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.

I'm not sure it applies so well in LLMs though (should read the paper...).

zhihaojia
You are right that CUDA graph can help reduce launch overhead but does not support overlapping computation/communication across layers, since data dependencies are described at the kernel level.

This item has no comments currently.