petu
Joined 6 karma
- petuI think your idea of MoE is incorrect. Despite the name they're not "expert" at anything in particular, used experts change more or less on each token -- so swapping them into VRAM is not viable, they just get executed on CPU (llama.cpp).
- By using ~3 bit quantized model with llama.cpp, Unsloth makes good quants:
https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-an...
Note that llama.cpp doesn't try to be production-grade engine, more focused on local usage.