Comment by mike_hearn

mike_hearn Jun 8, 2023 parent

It's probably not necessary though. Just cleaning the input data well enough and then ensuring the vocabulary is matched with the actual training data should be sufficient?

Edit: the token list for GPT-4 looks pretty clean. Overwhelmingly dominated by code but just eyeballing fragments of the set I didn't see any tokens in there that were obviously going to be rare.

This item has no comments currently.

It looks like you have JavaScript disabled. This web app requires that JavaScript is enabled. Please enable JavaScript to use this site (or just go read Hacker News).

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous