sp332 parent
Yes, character by character is the future. At least for low-density languages like English (around one bit per character), it can be hard for an LLM to start getting the idea of what's going on statistically. Current tokenization methods are quite naive, but progress is being made on improving semantic meaning of tokens and also in improving performance of character-based models.
The linked tweet promotes a paper where we operate directly on bytes - I want to say I agree this is the future instead.
That might work fine for English, but I would be very surprised if it performed better for non-ASCII logograms e.g. hanzi/kanji. Or would the model just learn those characters on its own?
Interestingly enough, for Japanese, the average is more than 1 token per Japanese character. It's not at the byte level, but it's clearly at a sub-utf-8 level. And yet, considering this, it works amazingly well.