Comment by refulgentis

refulgentis Jun 19, 2025 parent

You've hit the nail on the head. The CPU launch cost of a pre-compiled CUDA graph is tiny.

CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.

With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.

skavi Jun 20, 2025

Ah, that makes a lot of sense. Is this fine grained task scheduling related to CUDA Dynamic Parallelism at all? If not would you have a pointer on where to look?

I suppose I could look through the code of this project, but I’d hate to have detangle that from the compiler infrastructure.

touisteur 5 days ago

Think more of it as 'tasking by hand' where you have one kernel driving the 18k+ cores and you manually (or using device libraries) fine-grained synchronize them, handle memory traffic asynchoronously and pipeline as much as you can.

You might have a look at cooperative groups, also things cuda::pipeline in libcudacxx to handle asynchronous and pipelined memory traffic, and also most of block/warp CUB primitives, and move on up to cuFFTDx, cuBLASDx and now cuSolverDx as the starting toolbox for your fused kernel journey.

This item has no comments currently.