Comment by fho - Hacker Neue

fho 5 days ago parent

Somewhat relevant anecdote: we had a small CUDA competition (10-ish years ago). Some embrassingly parallel CV algorithm.

I tried to be smart and cache intermediate results that were shared by multiple kernels.

When the results were published I was stumped to see that others were orders of magnitude faster then me.

Turns out they didn't bother with caching at all. The overhead of recalculating everything a thousand times was tiny compared to the overhead of doing roundtrips through RAM.

I assume it's the same thing here. By compiling into MegaKernels, layer boundaries are squashed. There likely will be _more_ calculations and less shared intermediate results. But overall it's still a win due to less memory roundtrips.

There has to be a sweet spot, especially for convolution networks. No idea if the MegaKernel takes this into account.

This item has no comments currently.