It's probably not necessary though. Just cleaning the input data well enough and then ensuring the vocabulary is matched with the actual training data should be sufficient?
Edit: the token list for GPT-4 looks pretty clean. Overwhelmingly dominated by code but just eyeballing fragments of the set I didn't see any tokens in there that were obviously going to be rare.
Edit: the token list for GPT-4 looks pretty clean. Overwhelmingly dominated by code but just eyeballing fragments of the set I didn't see any tokens in there that were obviously going to be rare.