Noob here. Why is that number bad?
LLM performance depends on doing a lot of math on a lot of different numbers. For example, if your model has 8 billion parameters, and each parameter is one byte, then for 256gb/s you can't do better than 32 tokens per second. So if you try to load a model that's 80 gigs, you only get 3.2 tokens per second, which is kinda bad for something that costs 3-4k.
There's newer models called "Mixture of Experts" that are, say, 120b parameters, but only use 5b parameters per token (the specific parameters are chosen via a much smaller routing model). That is the kind of model that excels on this machine. Unfortunately again, those models work really well when doing hybrid inference, because the GPU can handle the small-but-computationally-complex fully connected layers while the CPU can handle the large-but-computationally-easy expert layers.
This product doesn't really have a niche for inference. For training and prototyping is another story, but I'm a noob on those topics.
Running LLMs will be slow and training them is basically out of the question. You can get a Framework Desktop with similar bandwidth for less than a third of the price of this thing (though that isn't NVIDIA).
> Running LLMs will be slow and training them is basically out of the question
I think it's the reverse, the use case for these boxes are basically training and fine-tuning, not inference.
File this one in the blue folder like the DGX