Comment by rryan - Hacker Neue

rryan Jun 25, 2025 parent

Don't make me tap the sign: There is no such thing as "bytes". There are only encodings. UTF-8 is the encoding most people are using when they talk about modeling "raw bytes" of text. UTF-8 is just a shitty (biased) human-designed tokenizer of the unicode codepoints.

cschmidt Jun 25, 2025

Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.

cschmidt Jun 25, 2025

And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689

hiddencost Jun 25, 2025

Well akshually...

I assume you started programming some time this millennia? That's the only way I can explain this "take".

vaxman Jun 26, 2025

Roger, who spoke only Chinglish and never paused between words, was working on a VAX FORTRAN program that exchanged tapes with IBM mainframes and a memory mapped section, inventing a new word in the process that still has me rolling decades later: ebsah-dicky-asky-codah

roflcopter69 Jun 25, 2025

Care to elaborate?

This item has no comments currently.