Comment by GaggiX - Hacker Neue

GaggiX Jun 8, 2023 parent

It's a subreddit where people count. There are a lot of comments, and unfortunately the tokenizer was trained on this subreddit, which is why these weird tokens.

georgeg23 Jun 8, 2023

My hunch is the token “ davidjl” is repetitively used in the data that determines the tokenization scheme but is nearly (or completely) absent from the actual pre-train data.

-Riley Goodside

neonate Jun 9, 2023

That is a quote from https://twitter.com/goodside/status/1666609934635606017.

mike_hearn Jun 8, 2023

That was investigated here:

https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

The idea that tokens found closest to the centroid are those that have moved the least from their initialisations during their training (because whatever it was that caused them to be tokens was curated out of their training corpus) was originally suggested to us by Stuart Armstrong. He suggested we might be seeing something analogous to "divide-by-zero" errors with these glitches. However, we've ruled that out.

The counting Reddit usernames are clearly a major source of glitch tokens, there's something about that sub that screws up the model, maybe the unusually predictable/repetitive nature of what goes on there or the sheer obsessiveness of the posters.

The fact that the only non-counting Reddit usernames are both people at the center of massive Bitcoin psychodramas is suggestive however, especially given the associations with the petertodd token. My guess is that these names appear much more frequently in other people's posts than is normal for forum usernames (especially if it read bitcointalk.org and twitter too), and that maybe this has some effect.

To get an idea of just how frequently this token will crop up in the reddit corpus, just do a search for it:

https://www.reddit.com/search/?q=%22petertodd%22&sort=commen...

I'm not sure how many posts match but you can keep scrolling a long time.

Having been more exposed than normal to what happened back then I wasn't hugely surprised by what concepts clustered there, and sadly don't think it's actually random or unexplainable. Flick through discussions pre-2015 (when the Bitcoin forums started to be heavily censored) and you'll see a lot of very similar words as what GPT spits out being used in connection. It was all very nasty.

samstave Jun 8, 2023 (dead)

mgsouth Jun 8, 2023

Well, for one thing the "limerick" doesn't scan. I think the first line was a gimme--"limerick" and "There once was a ..." have got to be really, really correlated in the model, so only "model so smart" needed to newly-minted with the correct rythm. The "It spoke with finesse" and "generating success" lines are short, so fewer chances to mess it up (although the "ing" in "generating" is a little soft; I think "generated" would have a better emphasis pattern). The other two lines are just terrible.

Ah, a Google search for 'limerick "finesse" "success"' had a few matching ones on the first page.

mike_hearn Jun 8, 2023

Here's my non expert explanation, take it for what it's worth.

Tokens solve the following problem: the first layers of the neural net(s) are a big array of numbers, where the size is a hyperparameter (a parameter that can be used to change the size or behavior of the network). A typical size for a big model like the GPTs is larger than 50,000. You have to somehow encode language as a sequence of assignments to this array of numbers. How do you do it?

The first and most obvious idea is characters. You could assign each unicode code point to one of the slots in the input array, and then use what they call a one-hot encoding where every number is zero except the character. You can do this, but it's not very efficient because virtually all the training text is written in Latin languages and so almost all the slots will be unused.

A better way is to start with a big pile of text that you're going to use, and then iteratively assign the slots to sequences of characters based on how common they are. This makes it easier for the network to learn and reason. These sequences are tokens. For example "and" is very common, and so would "ing" (from the suffix), so those should get their own tokens. SuperGoldMagiCarp isn't so that really shouldn't - it should be represented as a sequence of tokens instead. There are algorithms that figure out the most efficient assignment of tokens to character sequences, which you can think of as the model's vocabulary, and then to convert text into a sequence of these token numbers. OpenAI's software is called Tiktoken and is written in Rust for speed.

The output of the network is likewise also tokens, and so you run the process in reverse at the end to get text back out of the array of floats that the network produces. Or more accurately, the network produces a set of probabilities for each token in its vocabulary, and then you can pick the most probable (oversimplification - in reality the way you select the token is more complicated than that as otherwise you get bad results).

The problems here are to do with bugs in the training process, but are no less interesting for that. Some character sequences that are very rare and should really be represented by many different tokens in a row have somehow ended up being considered important enough to be given their own whole token. The most common cause of this seems to be cases where the token vocabulary was computed on text that contained garbage highly repetitive text like debug logs from video games, hence the prevalence of obscure game characters like Leilan, and Reddit threads are clearly over-represented. But then GPT struggles to work out what these tokens actually mean because they hardly appears in the training set, and so these tokens seem to float together in space and get easily conflated inside the model, also they end up representing very vague or abstract concepts.

mcv Jun 8, 2023

Not particularly humble, is it?

GaggiX OP Jun 8, 2023

Yeah this is true with every glitch token, the distribution of the tokenizer and the GPT model is very different.

countmora Jun 8, 2023

> unfortunately the tokenizer was trained on this subreddit

Do you have a source for that or was it just an assumption?

mike_hearn Jun 8, 2023

It's pretty much guaranteed. Where else on the internet would this sequence of characters appear so frequently that it gets selected as one of the internet's top ~50,000 words?

Also, that Reddit is frequently used to train LLMs is widely known. It's an unusually clean source of conversational text because you can slice threads (i.e. pick a root comment, then pick a child, then a child of the child etc and then concatenate the results), and you'll get a coherent conversation. There are relatively few places on the internet where that is true. For example most phpBB forums conflate many different conversations into single threads, with ad-hoc quoting being used to disambiguate which post is replying to which. That makes it a lot harder to generate sample conversations from.

dontupvoteme Jun 9, 2023

>There are relatively few places on the internet where that is true

Imageboards.

DailyMail.

Slashdot.

Even a somethingawful dump would have been superior.

mike_hearn Jun 9, 2023

Slashdot doesn't have the volume. Don't know about image boards but are they threaded and do they cover as many topics?

The Daily Mail (the newspaper) has been used for training LLMs in the past, yes. I don't know if it still is.

red-iron-pine Jun 9, 2023

imageboards sure do -- poorly.

listen, some of the niche corners of that world aren't so bad, but it ain't the place to be training AI to do something, unless that something is a hate crime

dontupvoteme Jun 9, 2023

So filter the content before you use it. Clearly openAI did the bare minimum on this front.

gl-prod Jun 8, 2023

There was a video[0] on Computerphile about this topic

[0] https://www.youtube.com/watch?v=WO2X3oZEJOA

klooney Jun 10, 2023

See the old SolidGoldMagikarp drama- it's happened before.

perfmode Jun 9, 2023

is this why reddit is charging for its API? to try to capture value from the entities training LLMs on reddit data?

asddubs Jun 9, 2023

that ship has probably sailed, since pushshift hosts a copy of all of reddit, including removed content. it's really weird to me that they even allowed that level of access for that long (FWIW I don't agree with the recent changes, but I see little legitimate reason to give someone a way to scan and copy all recently made posts across all subreddits).

pjc50 Jun 9, 2023

Archiving. https://www.hackerneue.com/item?id=36254172

It's very much "he who has the power to destroy a thing controls that thing"; so long as there's only one canonical copy of a post and it's on Reddit's servers, that's vulnerable to actions by the subreddit mods or Reddit themselves.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous