Preferences

(I'm not an expert on LLMs, this is just my guess)

No tokens is like no compression. It's much less efficient. My previous sentence is 12 tokens (24 bytes), or 58 characters (58 bytes). Your self attention block will be able to handle a much smaller effective context.

Also, especially in english, tokens by itself have quite a lot of semantic meaning by themselves, while characters are basically meaningless. So it probably needs extra work to piece those characters together vs operating on a token that already has meaning.


> tokens by itself have quite a lot of semantic meaning by themselves, while characters are basically meaningless

maybe one approach would be to tokenize not on tokens automatically found using BPE (Byte Pair Encoding) but tokens that are semantically meaningful - word prefixes, suffixes and syllables. Of course we will have to decide whether to divide the word helicopter into heli-copter (pronunciation) or helico-pter (meaning), but both alternatives encode much more transferable meaning than GPT's he-lic-opter.

Of course the difficulty is that that's ... difficult. Your tokenizer would first need to understand what language a word is in, then how to split it in the context of that language. Maybe a task for a whole ML model, where current tokenizers can be simple largest-substring matchers.

Hmmm, is there any way for a model to predict / edit its own tokens as it learns?

Kind of a recursive, self-evolving tokenization process?

I’m pretty new to deep learning, so this analogy might be off. But I’m reminded of convolutional layers inside image-based models. Could we have “tokenization layers” in a GPT model?

---

Edit: I asked GPT-4 about my idea.

It said it the comparison between tokenization and convoluational feature detection is a “someawhat accurate” analogy. And it agreed that making the encoder a fixed separate process that doesn’t evolve during training does limit the GPT model in certain ways and introduce quirks.

But it said that it might increase the computational requirements significantly, and that the transformer-architecture doesn’t lend itself to having “tokenization layers”, and it isn’t clear how one could do that.

It did say that there may be ways to work around the main challenges, and that there is some research in this direction.

> helico-pter (meaning)

That's not "meaning", that's etymology. If you divide it by "meaning" you get heli-copter, which is evidenced by the way the speakers use those morphemes: helipad, heliskiing, heliport,... and parcelcopter, gyrocopter, quadcopter,...

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal