In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.
I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.
Please take another look at my original comment. I was being precise about the distinction between what's structurally possible to generalize vs memorize.
Count the number of Rs in this sequence: [496, 675, 15717]
Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25
Human: Which is the easier of these formulas
1. x = SQRT(4)
2. x = SQRT(123567889.987654321)
Computer: They're both the same.
[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.
The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?
It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.
GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://www.hackerneue.com/item?id=39728870 )
More recent research:
https://huggingface.co/spaces/huggingface/number-tokenizatio...
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903
https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...
https://twitter.com/yuntiandeng/status/1836114401213989366
If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?
IME Reddit would scream "tokenization" at the strawberry meme until blue in the face, assuring themselves better tokenization meant the problem would be solved. Meanwhile RLHF'ers were/are en masse paid to solve the problem through correcting thousands of these "counting"/perfect syntax prompts and problems. To me, since RLHF work was being paid to tackle these problems, it couldn't be a simple tokenization problem. If there was a tokenization bottleneck that fixing would solve the problem, we would not be getting paid to so much money to RLHF synax-perfect prompts (think of Sudoku type games and heavy syntax-based problems).
No, why models are better are these problems now is because of RLHF. And before you say, well now models have learned how to count in general, I say we just need to widen the abstraction a tiny bit and the models will fail again. And this will be the story of LLMs forever--they will never take the lead on their own, and its not how humans process information, but it still can be useful.
As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.
Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:
Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.
When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.
> Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.
https://www.anthropic.com/news/tracing-thoughts-language-mod...
It seems to be about as useful as asking a person how their hippocampus works: they might be able to make something up, or repeat a vaguely remembered bit of neuroscience, but they don't actually have access to their own hippocampus' internal workings, so if they're correct it's by accident.
Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?