Preferences

Voloskaya parent
> they set their hiring bar for engineers too high

Not sure I agree, if you look at the head count growth of companies like OpenAI, Anthropic etc, it is super fast, its already pretty hard to keep everything working smoothly with that rate of employee growth, so going faster than that seems very risky.

Ultimately I think it's mostly caused by the field still being so new. Everything still needs to be optimized and there just aren't that many very good CUDA programmers to start with, then you need to find one that also has deep knowledge of ML and transformers architectures, which further drains the pool. And then when you do find one of them, there is 50 different things they could be working on instead of what's in the article, all equally or more impactful. The architectures being constantly evolving also make it hard/not a great ROI to go super super deep in single digit % optimization when there is new stuff coming out all the time that can be made an order of magnitude faster.

A good example of that is flash attention: it is maybe the most significant/impactful optimization in ML of the last few years. Tl;dr is how do you fuse the entire attention pipeline together to make it much faster and avoid massive tensor materialization. The bottleneck was obvious to anyone that profiled a Transformer-based model, but there was no obvious solution because of how softmax works. Yet the paper that ultimately unblock this was published back in 2019 [1], but it took 3 years for a team to connect the dots. Most people in pure ML engineering didn't know about the paper and don't have good enough CUDA knowledge/ GPU arch understanding, most people with good CUDA knowledge don't understand ML well enough, and even the author of that 2019 paper said "[we] hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware" but didn't have the technical skills to test this or to see how that could be part of a bigger breakthrough because it requires understanding core concepts in how GPU worked and compute/memory imbalance.

[1]: https://arxiv.org/pdf/1805.02867


This item has no comments currently.