Comment by zachooz - Hacker Neue

zachooz 1 day ago parent

A sequence of characters is grouped into a "token." The set of all such possible sequences forms a vocabulary. Without loss of generality, consider the example: strawberry -> straw | ber | ry -> 3940, 3231, 1029 -> [vector for each token]. The raw input to the model is not a sequence of characters, but a sequence of token embeddings each representing a learned vector for a specific chunk of characters. These embeddings contain no explicit information about the individual characters within the token. As a result, if the model needs to reason about characters, for example, to count the number of letters in a word, it must memorize the character composition of each token. Given that large models like GPT-4 use vocabularies with 100k–200k tokens, it's not surprising that the model hasn't memorized the full character breakdown of every token. I can't imagine that many "character level" questions exist in the training data.

In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.

I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.

saurik 1 day ago

It isn't at all obvious to me that the LLM can decide to blur their vision, so to speak, and see the tokens as tokens: they don't get to run a program on this data in some raw format, and even if they do attempt to write a program and run it in a sandbox they would have to "remember" what they were given and then regenerate it (well, I guess a tool could give them access to the history of their input, but at that point that tool likely sees characters), rather than to copy it. I am 100% with andy99 on this: it isn't anywhere near as simple as you are making it out to be.

zachooz OP 1 day ago

If each character were represented by its own token, there would be no need to "blur" anything, since the model would receive a 1:1 mapping between input vectors and individual characters. I never claimed that character-level reasoning is easy or simple for the model; I only said that it becomes theoretically possible to generalize ("potentially learn") without memorizing the character makeup of every token, which is required when using subword tokenization.

Please take another look at my original comment. I was being precise about the distinction between what's structurally possible to generalize vs memorize.

This item has no comments currently.