Comment by Eager - Hacker Neue

Eager Mar 16, 2024 parent

I don't know what they did specifically with GPT-2 as far as tokenization. It is probably hand massaged BPE.

GPT-2 was a pretty small model and it only had to do enough to be impressive at the time in showing some magic.

It doesn't matter of course that the token identifiers fall in any kind of continuous range since the model never sees them, since they are replaced with the embeddings. That said, sometimes I have wondered if it would have been better to force some natural ordering for these things.

The token layout for LLaMA and LLaMA 2 also leaves much to be desired, even GPT-3 and GPT-4 are all over the shop. Gemini is pretty neat from what I have seen of it. As far as I know it mostly deals with digits. Also, the way spaces are dealt with is different. You don't get a bunch of words with hard coded spaces on the front like you do with LLaMA.

It's interesting because there seems to be a trend with these models to head toward almost byte oriented approaches in certain areas of the dictionary.

That would kind of make sense as the models get larger, the tokens don't have to do as much.

I must say that in my own experiments on very small models, you do get an itch to use pretty fat tokens. It looks good when you demo it, that is for sure. But, I think it is much harder for the model because as you noted, it has to create a lot of rules and exceptions internally that more simplistic token schemes would avoid.

Interesting stuff.

This item has no comments currently.