pphysch parent
That might work fine for English, but I would be very surprised if it performed better for non-ASCII logograms e.g. hanzi/kanji. Or would the model just learn those characters on its own?
Interestingly enough, for Japanese, the average is more than 1 token per Japanese character. It's not at the byte level, but it's clearly at a sub-utf-8 level. And yet, considering this, it works amazingly well.