Preferences

touisteur parent
It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.

I'm not sure it applies so well in LLMs though (should read the paper...).


This item has no comments currently.