Preferences

refulgentis parent
You've hit the nail on the head. The CPU launch cost of a pre-compiled CUDA graph is tiny.

CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.

With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.


Ah, that makes a lot of sense. Is this fine grained task scheduling related to CUDA Dynamic Parallelism at all? If not would you have a pointer on where to look?

I suppose I could look through the code of this project, but I’d hate to have detangle that from the compiler infrastructure.

touisteur
Think more of it as 'tasking by hand' where you have one kernel driving the 18k+ cores and you manually (or using device libraries) fine-grained synchronize them, handle memory traffic asynchoronously and pipeline as much as you can.

You might have a look at cooperative groups, also things cuda::pipeline in libcudacxx to handle asynchronous and pipelined memory traffic, and also most of block/warp CUB primitives, and move on up to cuFFTDx, cuBLASDx and now cuSolverDx as the starting toolbox for your fused kernel journey.

This item has no comments currently.