Comment by sp332 - Hacker Neue

sp332 Jun 8, 2023 parent

Yes, character by character is the future. At least for low-density languages like English (around one bit per character), it can be hard for an LLM to start getting the idea of what's going on statistically. Current tokenization methods are quite naive, but progress is being made on improving semantic meaning of tokens and also in improving performance of character-based models.

dhruvdh Jun 8, 2023

The linked tweet promotes a paper where we operate directly on bytes - I want to say I agree this is the future instead.

pphysch Jun 8, 2023

That might work fine for English, but I would be very surprised if it performed better for non-ASCII logograms e.g. hanzi/kanji. Or would the model just learn those characters on its own?

glandium Jun 9, 2023

Interestingly enough, for Japanese, the average is more than 1 token per Japanese character. It's not at the byte level, but it's clearly at a sub-utf-8 level. And yet, considering this, it works amazingly well.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous