-Riley Goodside
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
The idea that tokens found closest to the centroid are those that have moved the least from their initialisations during their training (because whatever it was that caused them to be tokens was curated out of their training corpus) was originally suggested to us by Stuart Armstrong. He suggested we might be seeing something analogous to "divide-by-zero" errors with these glitches. However, we've ruled that out.
The counting Reddit usernames are clearly a major source of glitch tokens, there's something about that sub that screws up the model, maybe the unusually predictable/repetitive nature of what goes on there or the sheer obsessiveness of the posters.
The fact that the only non-counting Reddit usernames are both people at the center of massive Bitcoin psychodramas is suggestive however, especially given the associations with the petertodd token. My guess is that these names appear much more frequently in other people's posts than is normal for forum usernames (especially if it read bitcointalk.org and twitter too), and that maybe this has some effect.
To get an idea of just how frequently this token will crop up in the reddit corpus, just do a search for it:
https://www.reddit.com/search/?q=%22petertodd%22&sort=commen...
I'm not sure how many posts match but you can keep scrolling a long time.
Having been more exposed than normal to what happened back then I wasn't hugely surprised by what concepts clustered there, and sadly don't think it's actually random or unexplainable. Flick through discussions pre-2015 (when the Bitcoin forums started to be heavily censored) and you'll see a lot of very similar words as what GPT spits out being used in connection. It was all very nasty.
Ah, a Google search for 'limerick "finesse" "success"' had a few matching ones on the first page.
Tokens solve the following problem: the first layers of the neural net(s) are a big array of numbers, where the size is a hyperparameter (a parameter that can be used to change the size or behavior of the network). A typical size for a big model like the GPTs is larger than 50,000. You have to somehow encode language as a sequence of assignments to this array of numbers. How do you do it?
The first and most obvious idea is characters. You could assign each unicode code point to one of the slots in the input array, and then use what they call a one-hot encoding where every number is zero except the character. You can do this, but it's not very efficient because virtually all the training text is written in Latin languages and so almost all the slots will be unused.
A better way is to start with a big pile of text that you're going to use, and then iteratively assign the slots to sequences of characters based on how common they are. This makes it easier for the network to learn and reason. These sequences are tokens. For example "and" is very common, and so would "ing" (from the suffix), so those should get their own tokens. SuperGoldMagiCarp isn't so that really shouldn't - it should be represented as a sequence of tokens instead. There are algorithms that figure out the most efficient assignment of tokens to character sequences, which you can think of as the model's vocabulary, and then to convert text into a sequence of these token numbers. OpenAI's software is called Tiktoken and is written in Rust for speed.
The output of the network is likewise also tokens, and so you run the process in reverse at the end to get text back out of the array of floats that the network produces. Or more accurately, the network produces a set of probabilities for each token in its vocabulary, and then you can pick the most probable (oversimplification - in reality the way you select the token is more complicated than that as otherwise you get bad results).
The problems here are to do with bugs in the training process, but are no less interesting for that. Some character sequences that are very rare and should really be represented by many different tokens in a row have somehow ended up being considered important enough to be given their own whole token. The most common cause of this seems to be cases where the token vocabulary was computed on text that contained garbage highly repetitive text like debug logs from video games, hence the prevalence of obscure game characters like Leilan, and Reddit threads are clearly over-represented. But then GPT struggles to work out what these tokens actually mean because they hardly appears in the training set, and so these tokens seem to float together in space and get easily conflated inside the model, also they end up representing very vague or abstract concepts.
Do you have a source for that or was it just an assumption?
Also, that Reddit is frequently used to train LLMs is widely known. It's an unusually clean source of conversational text because you can slice threads (i.e. pick a root comment, then pick a child, then a child of the child etc and then concatenate the results), and you'll get a coherent conversation. There are relatively few places on the internet where that is true. For example most phpBB forums conflate many different conversations into single threads, with ad-hoc quoting being used to disambiguate which post is replying to which. That makes it a lot harder to generate sample conversations from.
Imageboards.
DailyMail.
Slashdot.
Even a somethingawful dump would have been superior.
The Daily Mail (the newspaper) has been used for training LLMs in the past, yes. I don't know if it still is.
listen, some of the niche corners of that world aren't so bad, but it ain't the place to be training AI to do something, unless that something is a hate crime
It's very much "he who has the power to destroy a thing controls that thing"; so long as there's only one canonical copy of a post and it's on Reddit's servers, that's vulnerable to actions by the subreddit mods or Reddit themselves.