I don't know what they did specifically with GPT-2 as far as tokenization. It is probably hand massaged BPE.
GPT-2 was a pretty small model and it only had to do enough to be impressive at the time in showing some magic.
It doesn't matter of course that the token identifiers fall in any kind of continuous range since the model never sees them, since they are replaced with the embeddings. That said, sometimes I have wondered if it would have been better to force some natural ordering for these things.
The token layout for LLaMA and LLaMA 2 also leaves much to be desired, even GPT-3 and GPT-4 are all over the shop. Gemini is pretty neat from what I have seen of it. As far as I know it mostly deals with digits. Also, the way spaces are dealt with is different. You don't get a bunch of words with hard coded spaces on the front like you do with LLaMA.
It's interesting because there seems to be a trend with these models to head toward almost byte oriented approaches in certain areas of the dictionary.
That would kind of make sense as the models get larger, the tokens don't have to do as much.
I must say that in my own experiments on very small models, you do get an itch to use pretty fat tokens. It looks good when you demo it, that is for sure. But, I think it is much harder for the model because as you noted, it has to create a lot of rules and exceptions internally that more simplistic token schemes would avoid.
Interesting stuff.
thethirdone
It has seemed to me that the GPT would be considerably better at numbers if it just considered each digit as a token. Has anyone actually done an experiment to test this?
I wouldn't disbelieve that the grouped version is actually better with data, but it fights my intuition pretty hard. Grouping based on frequency obfuscates the regular nature of numbers.
og_kalu
>It has seemed to me that the GPT would be considerably better at numbers if it just considered each digit as a token. Has anyone actually done an experiment to test this?
Very interesting paper. It does make sense to me the R2L chunking would be better than L2R chunking. It doesn't actually study single digit tokenization.
I am mostly interested in a direct comparison between an LLM wide tokenization vs single digit tokenization. It would be nice to see a direct comparison between similarly trained models. Otherwise it is very hard to get a definitive answer by comparing models with varying sizes, training time, and general strength.
I have seen this paper before, but hadn't payed attention to the p10 vs p100 analysis. Its not clear that the findings would be relevant to an LLM like gtp4 though.
throwawaymaths
> Grouping based on frequency obfuscates the regular nature of numbers.
It's literally a lookup table to go from token to embedding. What would you expect as improved? At best, maybe cache coherency if groups of numbers are converted to embeddings sequentially... But embeddings are huge (e.g. 8kb for llama-2) so you're losing caches jumping around between non-contiguous numbers anyways.
thethirdone
The nature of numbers as `A10^(n+1) + B10^n` for digits `XXXABXXX` is a very important relationship for doing any arithmetic. As you tokenize strings of digits, you lose the position information within the token make more complicated relationships between tokens because the total number of token pairs increases.
For example in order for a super simple model to learn 3 digit multiplication, it would need to see at least one example for each token in order to get ANY information about what number it represents. Alternatively, with single digits you only need an example where each position is present in each location. Obviously, we would hope to have plenty of data, but I would expect better generalization from models which need to rely on memorization less.
Alternatively, I can see a few reason why grouped digits would be better, but they are more complicated reasons than the reason above so by Occam's Razor my intuition says single digits should be better.
msp26
>After spending a lot of time with language models, I have come to the conclusion that tokenization in general is insane and it is a miracle that language models learn anything at all.
Truer words have never been spoken
throwawaymaths
Isn't this just ~ a frequency chart of how much number characters appear?
GPT-2 was a pretty small model and it only had to do enough to be impressive at the time in showing some magic.
It doesn't matter of course that the token identifiers fall in any kind of continuous range since the model never sees them, since they are replaced with the embeddings. That said, sometimes I have wondered if it would have been better to force some natural ordering for these things.
The token layout for LLaMA and LLaMA 2 also leaves much to be desired, even GPT-3 and GPT-4 are all over the shop. Gemini is pretty neat from what I have seen of it. As far as I know it mostly deals with digits. Also, the way spaces are dealt with is different. You don't get a bunch of words with hard coded spaces on the front like you do with LLaMA.
It's interesting because there seems to be a trend with these models to head toward almost byte oriented approaches in certain areas of the dictionary.
That would kind of make sense as the models get larger, the tokens don't have to do as much.
I must say that in my own experiments on very small models, you do get an itch to use pretty fat tokens. It looks good when you demo it, that is for sure. But, I think it is much harder for the model because as you noted, it has to create a lot of rules and exceptions internally that more simplistic token schemes would avoid.
Interesting stuff.
I wouldn't disbelieve that the grouped version is actually better with data, but it fights my intuition pretty hard. Grouping based on frequency obfuscates the regular nature of numbers.
Yeah
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs - https://arxiv.org/abs/2402.14903
xVal: A Continuous Number Encoding for Large Language Models - https://arxiv.org/abs/2310.02989
I believe there's another paper that demonstrates something like also for the likes of spelling, counting etc but i can't remember it.
Very interesting paper. It does make sense to me the R2L chunking would be better than L2R chunking. It doesn't actually study single digit tokenization.
I am mostly interested in a direct comparison between an LLM wide tokenization vs single digit tokenization. It would be nice to see a direct comparison between similarly trained models. Otherwise it is very hard to get a definitive answer by comparing models with varying sizes, training time, and general strength.
> xVal: A Continuous Number Encoding for Large Language Models - https://arxiv.org/abs/2310.02989
I have seen this paper before, but hadn't payed attention to the p10 vs p100 analysis. Its not clear that the findings would be relevant to an LLM like gtp4 though.
It's literally a lookup table to go from token to embedding. What would you expect as improved? At best, maybe cache coherency if groups of numbers are converted to embeddings sequentially... But embeddings are huge (e.g. 8kb for llama-2) so you're losing caches jumping around between non-contiguous numbers anyways.
For example in order for a super simple model to learn 3 digit multiplication, it would need to see at least one example for each token in order to get ANY information about what number it represents. Alternatively, with single digits you only need an example where each position is present in each location. Obviously, we would hope to have plenty of data, but I would expect better generalization from models which need to rely on memorization less.
Alternatively, I can see a few reason why grouped digits would be better, but they are more complicated reasons than the reason above so by Occam's Razor my intuition says single digits should be better.
Truer words have never been spoken
I'd be more interested in embeddings