Preferences

Didn't tokenization already have one bitter lesson: that it's better to let simple statistics guide the splitting, rather than expert morphology models? Would this technically be a more bitter lesson?

empiko
Agreed completely. There is a ton of research into how to represent text, and these simple tokenizers are consistently performing on SOTA levels. The bitter lesson is that you should not worry about it that much.
kingstnap
Simple statistics aren't some be all. There was a huge improvement in Python coding by fixing the tokenization of indents in Python code.

Specifically they made tokens for 4,8,12,16 or something spaces.

This item has no comments currently.