Comment by otabdeveloper4

otabdeveloper4 Jun 20, 2025 parent

24 gigabytes is more than enough to run a local LLM for a small household or business.

This is "gaming PC" territory, not "space heater". I mean people already have PS5's and whatnot in their homes.

The hundreds of gigabytes thing exists because the big cloud LLM providers went down the increasing parameter count path. That way is a dead end and we've reached negative returns already.

Prompt engineering + finetunes is the future, but you need developer brains for that, not TFLOPs.

rhdunn Jun 20, 2025

It depends on 1) what model you are running; and 2) how many models you are running.

You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.

If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.

fc417fc802 5 days ago

Only if you'll settle for less than state of the art. The best models still tend to be some of the largest ones.

Anything that overflows VRAM is going to slow down the response time drastically.

"Space heater" is determined by computational horsepower rather than available RAM.

How big a context window do you want? Last I checked that was very expensive in terms of RAM and having a large one was highly desirable.

otabdeveloper4 OP 5 days ago

State of the art is achieved by finetuning. Increasing parameter counts is a dead end.

Large contexts are very important but they are cheap compared in terms of RAM compared to the costs of increasing parameter count.

This item has no comments currently.