Somewhat relevant anecdote: we had a small CUDA competition (10-ish years ago). Some embrassingly parallel CV algorithm.
I tried to be smart and cache intermediate results that were shared by multiple kernels.
When the results were published I was stumped to see that others were orders of magnitude faster then me.
Turns out they didn't bother with caching at all. The overhead of recalculating everything a thousand times was tiny compared to the overhead of doing roundtrips through RAM.
I assume it's the same thing here. By compiling into MegaKernels, layer boundaries are squashed. There likely will be _more_ calculations and less shared intermediate results. But overall it's still a win due to less memory roundtrips.
There has to be a sweet spot, especially for convolution networks. No idea if the MegaKernel takes this into account.
I tried to be smart and cache intermediate results that were shared by multiple kernels.
When the results were published I was stumped to see that others were orders of magnitude faster then me.
Turns out they didn't bother with caching at all. The overhead of recalculating everything a thousand times was tiny compared to the overhead of doing roundtrips through RAM.
I assume it's the same thing here. By compiling into MegaKernels, layer boundaries are squashed. There likely will be _more_ calculations and less shared intermediate results. But overall it's still a win due to less memory roundtrips.
There has to be a sweet spot, especially for convolution networks. No idea if the MegaKernel takes this into account.