Comment by saagarjha - Hacker Neue

saagarjha Jun 19, 2025 parent

> The CPU launch cost of a graph is tiny

Absolutely not; it’s comparable to the launch overhead of a kernel.

skavi Jun 20, 2025

Fair enough. I should have clarified that “approximately the cost of a single kernel launch” is pretty much what I meant by “tiny”.

What I was getting at was that a “megakernel” and a captured graph should have similar launch costs.

touisteur 5 days ago

It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.

I'm not sure it applies so well in LLMs though (should read the paper...).

zhihaojia Jun 20, 2025

You are right that CUDA graph can help reduce launch overhead but does not support overlapping computation/communication across layers, since data dependencies are described at the kernel level.

This item has no comments currently.