Kind of a recursive, self-evolving tokenization process?
I’m pretty new to deep learning, so this analogy might be off. But I’m reminded of convolutional layers inside image-based models. Could we have “tokenization layers” in a GPT model?
---
Edit: I asked GPT-4 about my idea.
It said it the comparison between tokenization and convoluational feature detection is a “someawhat accurate” analogy. And it agreed that making the encoder a fixed separate process that doesn’t evolve during training does limit the GPT model in certain ways and introduce quirks.
But it said that it might increase the computational requirements significantly, and that the transformer-architecture doesn’t lend itself to having “tokenization layers”, and it isn’t clear how one could do that.
It did say that there may be ways to work around the main challenges, and that there is some research in this direction.
That's not "meaning", that's etymology. If you divide it by "meaning" you get heli-copter, which is evidenced by the way the speakers use those morphemes: helipad, heliskiing, heliport,... and parcelcopter, gyrocopter, quadcopter,...
maybe one approach would be to tokenize not on tokens automatically found using BPE (Byte Pair Encoding) but tokens that are semantically meaningful - word prefixes, suffixes and syllables. Of course we will have to decide whether to divide the word helicopter into heli-copter (pronunciation) or helico-pter (meaning), but both alternatives encode much more transferable meaning than GPT's he-lic-opter.
Of course the difficulty is that that's ... difficult. Your tokenizer would first need to understand what language a word is in, then how to split it in the context of that language. Maybe a task for a whole ML model, where current tokenizers can be simple largest-substring matchers.