It's a shame your accusations have zero merit. In the future, please try not to embarrass yourself by attempting to get into technical discussion, and promptly backing out of it having not made a single technical argument in the process. Good luck on the stock market
homie i work on AI infra at one of these companies that you're so casually citing in all of your marketing content here. you're not simply wrong on the things you claim - you're not even wrong. you literally don't know what you're talking about because you're citing external facing docs/code/whatever.
> attempting to get into technical discussion, and promptly backing out of it having not made a single technical argument in the process
there's no technical discussion to be had with someone that cites other people's work as proof for their own claims.
I'm always just like... who are you people: financiers, or hackers? :-) I don't work for TT, but I am a founder in the vertical AI space. Firstly, every major player is making AI accelerators of their own now, and guess what, most state-of-the-art designs have very little in common with a GPGPU design of yester-year. We have thoroughly evaluated various options, including buying/renting NVIDIA hardware; unfortunately, it didn't make any sense—neither in terms of cost, nor capability. Buying (and waiting _months_ for) NVIDIA rack-fuls is the quickest way to bankrupt your business with CAPEX. Renting the same hardware is merely moving the disease to OPEX, and in post-ZIRP era this is equally devastating.
No matter how much HBM memory you get for whatever individual device, no matter the packaging—it's never going to be enough. The weights alone are quickly dwarfed by K/V cache pages anyway. This is doubly true, if you're executing highly-concurrent agents that share a lot of the context, or doing dataset-scale inference transformations. The only thing that matters, truly, is the ability to scale-out, meaning fabrics, RDMA over fabrics. Even the leading-edge GPU systems aren't really good at it, because none of the interconnect is actually programmable.
The current generation of TT cards (7nm) has four 800G NIC's per card, and the actual Blackhole chips[1] support up to 12x400G. You can approach TT, they will license you the IP, and you get to integrate it at whatever scale you please (good luck even getting in a room with Arm people!) and because TT's whole stack is open source, you get to "punch in" whatever topology you want[2]. In other words, at least with TT you would get a chance to scale-out without bankrupting your business.
The compute hierarchy is fresh and in line with the latest research, their toolchain is as as hackable as it gets, and stands multiple heads above anything that AMD or Intel had ever released. Most importantly, because TT is currently under-valued, it presents an outstanding opportunity for businesses like ours in navigating around the established cost-centers. For example, TT still offers "Galaxy" deployments which used to contain 32 previous-generation (Wormhole) devices in a 6U air-cooled chassis. It's not a stretch that a similar setup, composed of 32 liquid-cooled Blackholes (2 TB GDDR6, 100 Tbps interconnect) would fit in a 4U chassis. AFAIK, There's no GPU deployment in the world at that density. Similarly to TPU design, it's also infinitely scalable by means of 3+D twisted torus topologies.
What's currently missing in the TT ecosystem: (1) the "superchip" package including state of the art CPU cores, like TT-Ascalon, that they would also happily license to you, and perhaps more importantly, (2) compute-in-network capability, so that the stupidly-massive TT interconnect bandwidth could be exploited/informed by applications.
Firstly, the Grendel superchip is expected to hit the market by the end of next year.
Secondly, because the interconnect is not some proprietary bullshit from Mellanox, you get to introduce the programmable-logic NIC's into the topology, and maybe even avoid IP encapsulation altogether! There are many reasons to do so, and indeed, Versal FPGA's have lots to offer in terms of hard IP in addition to PL. K/V cache management with offloading to NVMe-oF clusters, prefix-matching, reshaping, quantization, compression, and all the other terribly-parallel tasks which are basically intractable for anything other than FPGA's.
Today, if we wanted to do a large-scale training run, we would simply go for the most cost-effective option available at scale, which is renting TPU v6 from Google. This is a temporary measure, if anything, because compute-in-network in AI deployments is still a novelty, and nobody can really do it at sufficiently-large scale yet. Thankfully, Xilinx is getting there[3]. AWS offers f1 instances, it does offer NVMe-accelerated ones, as well as AI acclerators, but there's a good reason they're unable to offer all three at the same time.
[1] https://riscv.epcc.ed.ac.uk/assets/files/hpcasia25/Tenstorre...
[2] https://github.com/tenstorrent/tt-metal/blob/main/tech_repor...
[3] https://www.amd.com/en/products/accelerators/alveo/v80.html