> ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.
Right but at that point you're describing an H100 plus an additional ASIC plus presumably a CPU and some RAM. Or a variant of an H100 with some specialized ML functions baked in. Both of those just sound like a regular workstation to me.
Inference is certainly cheaper but getting it running quickly requires raw horsepower (thus wattage, thus heat dissipation).
Regarding CPUs there's a severe memory bandwidth issue. I haven't kept track of the extreme high end hardware but it's difficult to compete with GPUs on raw throughput.
Training is the thing that costs the most in terms of power/memory/energy, often requiring months of running multiple (likely 4-8) A100/H100 GPUs on the training data.
Performing inference is cheaper as you can 1) keep the model loaded in VRAM, and 2) run it on a single H100. With the 80GB capacity you would need two to run a 70B model at F16, or one at F8. For 32B models and lower you could run them on a single H100. Then you only need 1 or 2 GPUs to handle the request.
ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.
I think the sweat spot will be when CPUs have support for high-throughput matrix operations similar to the SIMD operations. That way the system will benefit from being able to use system memory [1] and not have another chip/board consuming power. -- IIUC, things are already moving in that direction for consumer devices.
[1] This will allow access to large amounts of memory without having to chain multiple GPUs. That will make it possible to run the larger models at higher precisions more efficiently and process the large amount of training data efficiently.