Preferences

catlifeonmars parent
From the article

> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.

So my naive assumption is that yes it is obvious, but nontrivial.


saagarjha
Your naive assumption is the right one. It’s quite hard to do this. Even doing it automatically like it’s done here runs into problems with trying to figure out data dependencies and synchronization across nontrivial computation.

This item has no comments currently.