Comment by ASalazarMX

For a LLM? No idea.

Human: Which is the easier of these formulas

1. x = SQRT(4)

2. x = SQRT(123567889.987654321)

Computer: They're both the same.

You can view the tokenization for yourself: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.

The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?

It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.

drdeca 1 day ago

Depending on the data types and what the hardware supports, the latter may be harder (in the sense of requiring more operations)? And for a general algorithm bigger numbers would take more steps.

This item has no comments currently.