Comment by pphysch - Hacker Neue

pphysch Jun 8, 2023 parent

That might work fine for English, but I would be very surprised if it performed better for non-ASCII logograms e.g. hanzi/kanji. Or would the model just learn those characters on its own?

glandium Jun 9, 2023

Interestingly enough, for Japanese, the average is more than 1 token per Japanese character. It's not at the byte level, but it's clearly at a sub-utf-8 level. And yet, considering this, it works amazingly well.

This item has no comments currently.

It looks like you have JavaScript disabled. This web app requires that JavaScript is enabled. Please enable JavaScript to use this site (or just go read Hacker News).

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous