Comment by oxcidized - Hacker Neue

oxcidized Nov 7, 2025 parent

> That's small enough to run well on ~$5,000 of hardware...

Honestly curious where you got this number. Unless you're talking about extremely small quants. Even just a Q4 quant gguf is ~130GB. Am I missing out on a relatively cheap way to run models well that are this large?

I suppose you might be referring to a Mac Studio, but (while I don't have one to be a primary source of information) it seems like there is some argument to be made on whether they run models "well"?

simonw Nov 7, 2025

Yes, I mean a Mac Studio with MLX.

An M3 Ultra with 256GB of RAM is $5599. That should just about be enough to fit MiniMax M2 at 8bit for MLX: https://huggingface.co/mlx-community/MiniMax-M2-8bit

Or maybe run a smaller quantized one to leave more memory for other apps!

Here are performance numbers for the 4bit MLX one: https://x.com/ivanfioravanti/status/1983590151910781298 - 30+ tokens per second.

zht Nov 8, 2025

It’s kinda misleading to omit the generally terrible prompt processing speed on Macs

30 tokens per second looks good until you have to wait minutes for the first token

simonw Nov 8, 2025

The tweet I linked to includes that information in the chart.

oxcidized OP Nov 8, 2025

Thanks for the info! Definitely much better than I expected.

fzzzy Nov 8, 2025

Running in cpu ram works fine. It’s not hard to build a machine with a terabyte of RAM.

oxcidized OP Nov 8, 2025

Admittedly I've not tried running on system RAM often, but every time I've tried it's been abysmally slow (< 1 T/s) when I've tried on something like KoboldCPP or ollama. Is there any particular method required to run them faster? Or is it just "get faster RAM"? I fully admit my DDR3 system has quite slow RAM...

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous