Preferences

It really is not obvious. These launches are asynchronous, and data movement / computation is overlapped properly through CUDA APIs. Even per-kernel launch cost is reduced with the cudagraph introduction.

CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.

Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.


gdiamos
It's surprising to me that the field is willing to invest this much in mega-kernels, but not models that generate multiple tokens in parallel...
liuliu OP
It is hard to justify tens-of-millions investment in training to just make it faster without any idea how it scores on benchmarks. It is easier to justify keeping the model intact and spend extra millions to make it faster with exotic means (megakernels).

There are some niche research on parallel token generations though as of late...

This item has no comments currently.