For a company of the size of Google, the development costs for a custom TPU are quickly recovered.
Comparing a Google TPU with an FPGA is like comparing an injection-moulded part with a 3D-printed part.
Unfortunately, the difference in performance between FPGAs and ASICs has greatly increased in recent years, because the FPGAs have remain stuck on relatively ancient CMOS manufacturing processes, which are much less efficient than the state-of-the-art CMOS manufacturing processes.
While common folk wisdom, this really isn't true. A surprising number of products ship with FPGAs inside, including ones designed to be "cheap". A great example of this is that Blackmagic, a brand known for being a "cheap" option in cinema/video gear, bases everything on Xilinx/AMD FPGAs (for some "software heavy" products they use the Xilinx/AMD Zynq line, which combines hard ARM cores with an FPGA). Pretty much every single machine vision camera on the market uses an FPGA for image processing as well. These aren't "one in every pocket" level products, but they are widely produced.
> Unfortunately, the difference in performance between FPGAs and ASICs has greatly increased in recent years, because the FPGAs have remain stuck on relatively ancient CMOS manufacturing processes
This isn't true either. At the high end, FPGAs are made on whatever the best process available is. Particularly in the top-end models that combine programmable fabric with hard elements, it would be insane not to produce them on the best process available. What is the big hindrance with FPGAs is that almost by definition the cell structures needed to produce programability are inherently more complex and less efficient than the dedicated circuits of an ASIC. That often means a big hit to maximum clock rate, with resulting consequences to any serial computation being performed.
While it is true that cheap and expensive FPGAs exist, an FPGA system to replace TPU would not use a $0.50 or even $100 FPGA it would use a Versal or Ultrascale+ FPGA that costs thousands, compared to the (rough guess) $100/die you might spend for largest chip on most advanced process. Furthermore, overhead of FPGA means every single one my support a few million logic gates (maybe 2-5x if you use hardened blocks), compare to billions of transistors on largest chips in most advanced node —> cost per chip to buy is much much higher.
To the second point, afaik, leading edge Versal FPGAs are in 7nm, not ancient also not cutting edge used for asic(n3).
While TSMC 7 nm is much better than what most FPGAs use, it is still ancient in comparison with what the current CPUs and GPUs use.
Moreover, I have never seen such FPGAs sold for less than thousands of $, i.e. they are much more expensive than GPUs of similar throughput.
Perhaps they are heavily discounted for big companies, but those are also the companies which could afford to design an ASIC with better energy efficiency.
I always prefer to implement an FPGA solution over any alternatives, but unfortunately much too often the FPGAs with high enough performance have also high enough prices to make them unaffordable.
The FPGAs with the highest performance that are still reasonably priced are the AMD UltraScale+ families, made with a TSMC 14 nm process, which is still good in comparison with most FPGAs, but nevertheless it is a decade old manufacturing process.
FPGA's kind of sit in this very niche middle ground. Yes you can optimize your logic so that the FPGA does exactly the thing that your use case needs, so your hardware maps more precisely to your use case than a generic TPU or GPU would. But what you gain in logic efficiency, you'll lose several times over in raw throughput to a generic TPU or GPU, at least for AI stuff which is almost all matrix math.
Plus, getting that efficiency isn't easy; FPGAs have a higher learning curve and a way slower dev cycle than writing TPu or GPU apps, and take much longer to compile and test than CUDA code, especially when they get dense and you have to start working around gate timing constraints and such. It's easy to get to a point where even a tiny change can exceed some timing constraint and you've got to rewrite a whole subsystem to get it to synthesize again.