> As shown in Figure 1, our megakernel outperforms vLLM and SGLang baselines (which use CUDA graphs and torch compilation)
I’m surprised the reduction in overhead for graphs vs streams alone was so little. I feel I’ve observed larger gains, but maybe I’m conflating CPU overhead with launch latency.
They should mention whether they did the graph uploads up front and whether they needed to change parameters within the graph.
It depends. Graphs should beat streams for repeated launches. The overhead of graph creation and instantiation makes graphs worse than streams unless you are relaunching the graph many times.
The sglang and vllm numbers are with cuda graphs enabled.
Having said that, 1B model is an extreme example - hence the 1.5x speedup. For regular models and batch sizes this would probably buy you a few percent.
Yep, was looking to see this comparison too. I loved their approach though.
I find their use of an on-GPU interpreter to be both a bit of an odd choice and interesting at the same time. Usually, you would not want to use an on-GPU interpreter for anything involving high performance. However, it sounds to me like there is not much room for improvement left under Amdahl's law since the instructions should call highly parallel functions that run orders of magnitude longer than the interpreter does to in order to make the function calls This in itself is interesting, although I still wonder how much room for improvement there would be if they dropped the interpreter.
As the interpreter is core to the approach, I'm not entirely sure what's left if you drop that.
Without numbers, I am left wondering whether they omitted CUDA graph benchmarks due to a lack of effort, or because they actually did the benchmarks and did not want to admit that their approach was not as much of a performance advance as they portray it to be.