Preferences

What 1T parameter base model have you seen from any of those labs?

its moe, each expert tower can be branched from some smaller model.
That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal