Preferences

Now we can see why they avoided giving a straight answer.

File this one in the blue folder like the DGX


stogot
Noob here. Why is that number bad?
TomatoCo
LLM performance depends on doing a lot of math on a lot of different numbers. For example, if your model has 8 billion parameters, and each parameter is one byte, then for 256gb/s you can't do better than 32 tokens per second. So if you try to load a model that's 80 gigs, you only get 3.2 tokens per second, which is kinda bad for something that costs 3-4k.

There's newer models called "Mixture of Experts" that are, say, 120b parameters, but only use 5b parameters per token (the specific parameters are chosen via a much smaller routing model). That is the kind of model that excels on this machine. Unfortunately again, those models work really well when doing hybrid inference, because the GPU can handle the small-but-computationally-complex fully connected layers while the CPU can handle the large-but-computationally-easy expert layers.

This product doesn't really have a niche for inference. For training and prototyping is another story, but I'm a noob on those topics.

abtinf
My mac laptop has 400gb/s bandwidth. LLMs are bandwidth bound.
kennethallen
Running LLMs will be slow and training them is basically out of the question. You can get a Framework Desktop with similar bandwidth for less than a third of the price of this thing (though that isn't NVIDIA).
embedding-shape
> Running LLMs will be slow and training them is basically out of the question

I think it's the reverse, the use case for these boxes are basically training and fine-tuning, not inference.

kennethallen
The use case for these boxes is a local NVIDIA development platform before you do your actual training run on your A100 cluster.
NaomiLehman
refurbished macbooks m1 for $1,500 have more with less latency

This item has no comments currently.