It's surprising to me that the field is willing to invest this much in mega-kernels, but not models that generate multiple tokens in parallel...
It is hard to justify tens-of-millions investment in training to just make it faster without any idea how it scores on benchmarks. It is easier to justify keeping the model intact and spend extra millions to make it faster with exotic means (megakernels).
There are some niche research on parallel token generations though as of late...
CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.
Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.