Comment by ijk - Hacker Neue

The math tokenization research is probably closest.

GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://www.hackerneue.com/item?id=39728870 )

More recent research:

https://huggingface.co/spaces/huggingface/number-tokenizatio...

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903

https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...

krackers 1 day ago

GPT-2 can successfully learn to do multiplication using the standard tokenizer though, using "Implicit CoT with Stepwise Internalization".

https://twitter.com/yuntiandeng/status/1836114401213989366

If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?

This item has no comments currently.