Preferences

I mean.... LLM-in-a-box would actually be pretty neat! I'm looking at some air-gapped work coming up and having something like that would be quite handy

fc417fc802
Isn't that easily accomplished by setting up a local deployment and then yanking the network cable? Anything that can quickly run a capable LLM is going to be a pretty beefy box though. More like LLM in an expensive space heater.
stirfish
I was thinking more like those Bitcoin mining usb Asics that used to be a thing, but instead of becoming ewaste, you can still use them to talk with chatgpt 2 or whatever. I'm picturing an llm appliance.
fc417fc802
There is no magic ASIC that can get around needing to do hundreds of watts worth of computations and having on the order of hundreds of gigabytes of very fast memory. Otherwise the major players would be doing that instead of (quite literally) investing in nuclear reactors to power their future data center expansions.
rhdunn
Google have their own ASIC via their TPU. The other major players have leveraged NVIDIA and -- to a lesser extent -- AMD. This is partly due to investment in TPUs/ASICs being complex (need specialist knowledge and fabrication units) and GPU performance being hard to compete with.

Training is the thing that costs the most in terms of power/memory/energy, often requiring months of running multiple (likely 4-8) A100/H100 GPUs on the training data.

Performing inference is cheaper as you can 1) keep the model loaded in VRAM, and 2) run it on a single H100. With the 80GB capacity you would need two to run a 70B model at F16, or one at F8. For 32B models and lower you could run them on a single H100. Then you only need 1 or 2 GPUs to handle the request.

ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.

I think the sweat spot will be when CPUs have support for high-throughput matrix operations similar to the SIMD operations. That way the system will benefit from being able to use system memory [1] and not have another chip/board consuming power. -- IIUC, things are already moving in that direction for consumer devices.

[1] This will allow access to large amounts of memory without having to chain multiple GPUs. That will make it possible to run the larger models at higher precisions more efficiently and process the large amount of training data efficiently.

fc417fc802
> ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.

Right but at that point you're describing an H100 plus an additional ASIC plus presumably a CPU and some RAM. Or a variant of an H100 with some specialized ML functions baked in. Both of those just sound like a regular workstation to me.

Inference is certainly cheaper but getting it running quickly requires raw horsepower (thus wattage, thus heat dissipation).

Regarding CPUs there's a severe memory bandwidth issue. I haven't kept track of the extreme high end hardware but it's difficult to compete with GPUs on raw throughput.

otabdeveloper4
24 gigabytes is more than enough to run a local LLM for a small household or business.

This is "gaming PC" territory, not "space heater". I mean people already have PS5's and whatnot in their homes.

The hundreds of gigabytes thing exists because the big cloud LLM providers went down the increasing parameter count path. That way is a dead end and we've reached negative returns already.

Prompt engineering + finetunes is the future, but you need developer brains for that, not TFLOPs.

rhdunn
It depends on 1) what model you are running; and 2) how many models you are running.

You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.

If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.

fc417fc802
Only if you'll settle for less than state of the art. The best models still tend to be some of the largest ones.

Anything that overflows VRAM is going to slow down the response time drastically.

"Space heater" is determined by computational horsepower rather than available RAM.

How big a context window do you want? Last I checked that was very expensive in terms of RAM and having a large one was highly desirable.

otabdeveloper4
State of the art is achieved by finetuning. Increasing parameter counts is a dead end.

Large contexts are very important but they are cheap compared in terms of RAM compared to the costs of increasing parameter count.

stirfish
That's a really good point. I wasn't thinking further than ollama on my MacBook, but I'm not deploying my laptop into production.
If you focus on just the matmuls, no CUDA, no architectures, no infinibands, everything-on-a-chip - put input tokens in input registers, get output tokens from output registers from a model that's baked into gates - you should be able to save some power. Not sure if 10x or 2x or 100x, but certainly there are gains to be had.

This item has no comments currently.