Preferences

petu
Joined 6 karma

  1. I think your idea of MoE is incorrect. Despite the name they're not "expert" at anything in particular, used experts change more or less on each token -- so swapping them into VRAM is not viable, they just get executed on CPU (llama.cpp).
  2. By using ~3 bit quantized model with llama.cpp, Unsloth makes good quants:

    https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-an...

    Note that llama.cpp doesn't try to be production-grade engine, more focused on local usage.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal