Comment by scotty79 - Hacker Neue

scotty79 Jun 19, 2025 parent

> Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.

What? Why? This seems like an obvious optimization if it's possible.

catlifeonmars Jun 19, 2025

From the article

> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.

So my naive assumption is that yes it is obvious, but nontrivial.

saagarjha Jun 19, 2025

Your naive assumption is the right one. It’s quite hard to do this. Even doing it automatically like it’s done here runs into problems with trying to figure out data dependencies and synchronization across nontrivial computation.

liuliu Jun 19, 2025

It really is not obvious. These launches are asynchronous, and data movement / computation is overlapped properly through CUDA APIs. Even per-kernel launch cost is reduced with the cudagraph introduction.

CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.

Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.

gdiamos 5 days ago

It's surprising to me that the field is willing to invest this much in mega-kernels, but not models that generate multiple tokens in parallel...

liuliu 5 days ago

It is hard to justify tens-of-millions investment in training to just make it faster without any idea how it scores on benchmarks. It is easier to justify keeping the model intact and spend extra millions to make it faster with exotic means (megakernels).

There are some niche research on parallel token generations though as of late...

delusional Jun 19, 2025

In the common case where the processor dispatching those kernel calls is much faster than the kernel calls themselves, you're not likely to see a meaningful increase in throughput.

What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.

shawntan Jun 19, 2025

Systems might want to anticipate changes in LLM architectures (even small changes can make a big difference kernel wise), so it's good to not "bake" too much in ahead of time.

That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.

Not as low-hanging as you might imagine.

This item has no comments currently.