Comment by pona-a - Hacker Neue

pona-a Jun 24, 2025 parent

Didn't tokenization already have one bitter lesson: that it's better to let simple statistics guide the splitting, rather than expert morphology models? Would this technically be a more bitter lesson?

empiko Jun 24, 2025

Agreed completely. There is a ton of research into how to represent text, and these simple tokenizers are consistently performing on SOTA levels. The bitter lesson is that you should not worry about it that much.

kingstnap Jun 24, 2025

Simple statistics aren't some be all. There was a huge improvement in Python coding by fixing the tokenization of indents in Python code.

Specifically they made tokens for 4,8,12,16 or something spaces.

This item has no comments currently.