> mostly because it’s new tech
Do you then think it'll improve to reach the same stability as other kinds of infra, eventually, or are there more fundamental limits we might hit?
My intuition is that as the models do more with less and the hardware improves, we'll end up with more stability just because we'll be able to afford more redundancy.
Hardware dependencies: GPUs and TPUs and all that are not equal. You will have to have code and caches that only work with Google’s TPUs, and other codes and caches that only work with CUDA, etc.
Data workflow: you will have huge LLM models that need to be loaded at just the right time.
Oh wait, your model uses MoE? That means the 200GB model that’s split over 10 “experts” only needs maybe 20GB of that. So then it would be great if we could somehow pre-route a request to the right GPU that already has that specific model loaded.
But wait! This is a long conversation, and the cache was actually on a different server. Now we need to reload the cache on the new server that actually has this particular expert preloaded in its GPU.
etc.
it’s very different, mostly because it’s new tech and very expensive and cost optimizations are difficult but impactful.