Preferences

Yes, character by character is the future. At least for low-density languages like English (around one bit per character), it can be hard for an LLM to start getting the idea of what's going on statistically. Current tokenization methods are quite naive, but progress is being made on improving semantic meaning of tokens and also in improving performance of character-based models.

The linked tweet promotes a paper where we operate directly on bytes - I want to say I agree this is the future instead.
That might work fine for English, but I would be very surprised if it performed better for non-ASCII logograms e.g. hanzi/kanji. Or would the model just learn those characters on its own?
Interestingly enough, for Japanese, the average is more than 1 token per Japanese character. It's not at the byte level, but it's clearly at a sub-utf-8 level. And yet, considering this, it works amazingly well.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal