This is (and was) the dream of Cerebras and I am very glad to see it embraced if even in small part on a GPU. Wild to see how much performance is left on the table for these things, it's crazy to think how much can be done by a few bold individuals when it comes to pushing the SOTA of these kinds of things (not just in kernels either -- in other areas as well!)
My experience has been that getting over the daunting factor of feeling afraid of a big wide world with a lot of noise and marketing and simply committing to a problem, learning it, and slowly bootstrapping it over time, tends to yield phenomenal results in the long run for most applications. And, if not, then there's often an applicable one/side field that can be pivoted to for still making immense/incredible progress.
The big players may have the advantage of scale, but there is so, so much that can be done still if you look around and keep a good feel for it. <3 :)
hardwaresofton
Meta note but this paper is wonderfully written and incredibly approachable — excellent work by the authors.
yababa_y
it’s definitely a blog post or article and not a paper, this isn’t structured as a paper and is missing a lot of the things expected from a paper.
and it is so wonderful for it:)
hardwaresofton
You’re right — I thought it was one of the papers with better UX that has been coming through recently — it’s just a blog post but wow I wish all the papers read like this.
clbrmbr
They really nailed a casual style that didn’t take way from the depth. More publications using the word “brr” please.
falcor84
Agreed, even though there's a general upwards trend, we've apparently hit peak "brr" (75 results) at 2023, and are significantly under-brring in 2025 with only 17 brr-focused publications so far. This is a call to arms for everyone, please brr harder!
After presenting their numbers, they mention that CUDA graphs also do much of this, but then say that the launch time is higher for them. It would have been more interesting if they had included comparison numbers.
Without numbers, I am left wondering whether they omitted CUDA graph benchmarks due to a lack of effort, or because they actually did the benchmarks and did not want to admit that their approach was not as much of a performance advance as they portray it to be.
skavi
> As shown in Figure 1, our megakernel outperforms vLLM and SGLang baselines (which use CUDA graphs and torch compilation)
I’m surprised the reduction in overhead for graphs vs streams alone was so little. I feel I’ve observed larger gains, but maybe I’m conflating CPU overhead with launch latency.
They should mention whether they did the graph uploads up front and whether they needed to change parameters within the graph.
01100011
It depends. Graphs should beat streams for repeated launches. The overhead of graph creation and instantiation makes graphs worse than streams unless you are relaunching the graph many times.
saagarjha
Graphs basically suck, they have high overhead for replays or even loop nodes. It should not take a microsecond for the GPU to queue up another kernel but it does.
skavi
I think the last sentence of the comment you’re replying to implies an awareness of that fact.
mmoskal
The sglang and vllm numbers are with cuda graphs enabled.
Having said that, 1B model is an extreme example - hence the 1.5x speedup. For regular models and batch sizes this would probably buy you a few percent.
boroboro4
Yep, was looking to see this comparison too. I loved their approach though.
ryao
I find their use of an on-GPU interpreter to be both a bit of an odd choice and interesting at the same time. Usually, you would not want to use an on-GPU interpreter for anything involving high performance. However, it sounds to me like there is not much room for improvement left under Amdahl's law since the instructions should call highly parallel functions that run orders of magnitude longer than the interpreter does to in order to make the function calls This in itself is interesting, although I still wonder how much room for improvement there would be if they dropped the interpreter.
saagarjha
As the interpreter is core to the approach, I'm not entirely sure what's left if you drop that.
ryao
Whatever they have their interpreter doing could be done via assembly code without a separate instruction stream that needs to be interpreted. It is like running qemu-user to execute a program by interpreting it versus having the CPU execute it directly, except on a GPU.
kcorbitt
It seems like the speedups here are most useful for small models, since on larger models a smaller fraction of the total time would be spent swapping between kernels? Would be interesting to see at least theoretical results for LLMs in the 14-70B parameter range, which is what most folks deploy in practice.
And of course the effect on throughput at larger batch sizes, which they allude to at the end.
Overall a very interesting result!
ptrj_
This could also give a nice speedup for MoE models w/ total 7B-70B parameters but O(10x) fewer active params, e.g. https://huggingface.co/Qwen/Qwen3-30B-A3B, assuming the expert router can be effectively scheduled within the monolithic mega-kernel.
mmoskal
They are reducing forward pass time from say 1.5ms to 1ms. On bigger model you would likely reduce from 15ms to 14.2ms or something like that.
saagarjha
The thing I find really disappointing about CUDA is that Nvidia could provide the synchronization primitives needed to do this easily, but they don't. Scheduling on their cores remains really dumb, even though I know there is a bunch of work being done behind the scenes to service whatever async warp-specialized matrix multiplication instruction they added in this generation. It's just that there's no way to access it directly and you have to use the little bespoke bits that get exposed in each generation :(
xixihaha
Very bold direction and I love it. Looks like a lot of CUDA expertise engineering. I am thinking why set batch size to 1? Hope to see comparison with real production with larger batch size. Also wondering how to extend it to other models, like MOE, expert parallel, CUDA kernel is not supported across GPUs?
saagarjha
Because people using it for interactive use use batch size 1
rudedogg
On a similar note:
I wonder if we'll see OS level services/daemons to try and lower the time to first token as these things get used more. And the interface for application developers will be a simple system prompt.
In some ways the idea sounds nice, but there would be a lot of downsides:
- Memory eaten up by potentially unused models
- Less compute available to software running specialized models for specific tasks
zackify
This is specifically why I use LM studio and paid for 128gb on my MacBook.
I kept devstral in memory 15gb~ always since I have so much extra.
I can’t wait for a few years from now where I can have triple the memory bandwidth at this size ram
wbl
I have sad news for you about how interface bandwidths have scaled. Yes banking can help but only so much.
kgeist
I've found that Linux already does this out of the box: if you load a model once, subsequent loads are much faster because the OS caches the on-disk pages entirely in RAM (assuming you have enough RAM). If you switch to another model and/or run out of RAM, the OS will automatically unload some parts to make room. So all you need to do is read the most used model(s) on startup, to warm the disk cache + add a TTL to unload models and stop actively occupying (V)RAM.
terhechte
Would this also be possible with other LLM engines / GPUs? E.g. Llama / Apple Silicon or Radeon?
saagarjha
Yeah, none of this is specific to CUDA (though the relative latencies might be different).
WhitneyLand
Why all the trouble to speed things up while at the same time using bfloat16?
motomoto5188
Wondering how much it would improve on prefilling?
Stem0037
I wonder how much of this overhead (like the 250µs for activations/consistency on B200) could be further chipped away with even finer-grained control or different sync primitives.
My experience has been that getting over the daunting factor of feeling afraid of a big wide world with a lot of noise and marketing and simply committing to a problem, learning it, and slowly bootstrapping it over time, tends to yield phenomenal results in the long run for most applications. And, if not, then there's often an applicable one/side field that can be pivoted to for still making immense/incredible progress.
The big players may have the advantage of scale, but there is so, so much that can be done still if you look around and keep a good feel for it. <3 :)
and it is so wonderful for it:)
https://openalex.org/works?page=1&filter=title_and_abstract....
Without numbers, I am left wondering whether they omitted CUDA graph benchmarks due to a lack of effort, or because they actually did the benchmarks and did not want to admit that their approach was not as much of a performance advance as they portray it to be.
I’m surprised the reduction in overhead for graphs vs streams alone was so little. I feel I’ve observed larger gains, but maybe I’m conflating CPU overhead with launch latency.
They should mention whether they did the graph uploads up front and whether they needed to change parameters within the graph.
Having said that, 1B model is an extreme example - hence the 1.5x speedup. For regular models and batch sizes this would probably buy you a few percent.
And of course the effect on throughput at larger batch sizes, which they allude to at the end.
Overall a very interesting result!
I wonder if we'll see OS level services/daemons to try and lower the time to first token as these things get used more. And the interface for application developers will be a simple system prompt.
In some ways the idea sounds nice, but there would be a lot of downsides:
- Memory eaten up by potentially unused models
- Less compute available to software running specialized models for specific tasks
I kept devstral in memory 15gb~ always since I have so much extra.
I can’t wait for a few years from now where I can have triple the memory bandwidth at this size ram