Preferences

It depends on 1) what model you are running; and 2) how many models you are running.

You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.

If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.


This item has no comments currently.