We run the nodes "hot" and close to overload for peak throughput. That's why NVIDIA's XQA innovation was so interesting, because it allows for much higher throughput for a given latency budget: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source....
Serverless would make more sense if we had a significant underutilization problem.
Serverless would make more sense if we had a significant underutilization problem.