Comment by ryao - Hacker Neue

ryao May 28, 2025 parent

After presenting their numbers, they mention that CUDA graphs also do much of this, but then say that the launch time is higher for them. It would have been more interesting if they had included comparison numbers.

Without numbers, I am left wondering whether they omitted CUDA graph benchmarks due to a lack of effort, or because they actually did the benchmarks and did not want to admit that their approach was not as much of a performance advance as they portray it to be.

skavi May 28, 2025

> As shown in Figure 1, our megakernel outperforms vLLM and SGLang baselines (which use CUDA graphs and torch compilation)

I’m surprised the reduction in overhead for graphs vs streams alone was so little. I feel I’ve observed larger gains, but maybe I’m conflating CPU overhead with launch latency.

They should mention whether they did the graph uploads up front and whether they needed to change parameters within the graph.

01100011 May 28, 2025

It depends. Graphs should beat streams for repeated launches. The overhead of graph creation and instantiation makes graphs worse than streams unless you are relaunching the graph many times.

saagarjha May 28, 2025

Graphs basically suck, they have high overhead for replays or even loop nodes. It should not take a microsecond for the GPU to queue up another kernel but it does.

skavi May 28, 2025

I think the last sentence of the comment you’re replying to implies an awareness of that fact.

mmoskal May 28, 2025

The sglang and vllm numbers are with cuda graphs enabled.

Having said that, 1B model is an extreme example - hence the 1.5x speedup. For regular models and batch sizes this would probably buy you a few percent.

boroboro4 May 28, 2025

Yep, was looking to see this comparison too. I loved their approach though.

ryao OP May 28, 2025

I find their use of an on-GPU interpreter to be both a bit of an odd choice and interesting at the same time. Usually, you would not want to use an on-GPU interpreter for anything involving high performance. However, it sounds to me like there is not much room for improvement left under Amdahl's law since the instructions should call highly parallel functions that run orders of magnitude longer than the interpreter does to in order to make the function calls This in itself is interesting, although I still wonder how much room for improvement there would be if they dropped the interpreter.

saagarjha May 28, 2025

As the interpreter is core to the approach, I'm not entirely sure what's left if you drop that.

ryao OP May 30, 2025

Whatever they have their interpreter doing could be done via assembly code without a separate instruction stream that needs to be interpreted. It is like running qemu-user to execute a program by interpreting it versus having the CPU execute it directly, except on a GPU.

This item has no comments currently.