Preferences

It's a subreddit where people count. There are a lot of comments, and unfortunately the tokenizer was trained on this subreddit, which is why these weird tokens.

My hunch is the token “ davidjl” is repetitively used in the data that determines the tokenization scheme but is nearly (or completely) absent from the actual pre-train data.

-Riley Goodside

That was investigated here:

https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

The idea that tokens found closest to the centroid are those that have moved the least from their initialisations during their training (because whatever it was that caused them to be tokens was curated out of their training corpus) was originally suggested to us by Stuart Armstrong. He suggested we might be seeing something analogous to "divide-by-zero" errors with these glitches. However, we've ruled that out.

The counting Reddit usernames are clearly a major source of glitch tokens, there's something about that sub that screws up the model, maybe the unusually predictable/repetitive nature of what goes on there or the sheer obsessiveness of the posters.

The fact that the only non-counting Reddit usernames are both people at the center of massive Bitcoin psychodramas is suggestive however, especially given the associations with the petertodd token. My guess is that these names appear much more frequently in other people's posts than is normal for forum usernames (especially if it read bitcointalk.org and twitter too), and that maybe this has some effect.

To get an idea of just how frequently this token will crop up in the reddit corpus, just do a search for it:

https://www.reddit.com/search/?q=%22petertodd%22&sort=commen...

I'm not sure how many posts match but you can keep scrolling a long time.

Having been more exposed than normal to what happened back then I wasn't hugely surprised by what concepts clustered there, and sadly don't think it's actually random or unexplainable. Flick through discussions pre-2015 (when the Bitcoin forums started to be heavily censored) and you'll see a lot of very similar words as what GPT spits out being used in connection. It was all very nasty.

Yeah this is true with every glitch token, the distribution of the tokenizer and the GPT model is very different.
> unfortunately the tokenizer was trained on this subreddit

Do you have a source for that or was it just an assumption?

It's pretty much guaranteed. Where else on the internet would this sequence of characters appear so frequently that it gets selected as one of the internet's top ~50,000 words?

Also, that Reddit is frequently used to train LLMs is widely known. It's an unusually clean source of conversational text because you can slice threads (i.e. pick a root comment, then pick a child, then a child of the child etc and then concatenate the results), and you'll get a coherent conversation. There are relatively few places on the internet where that is true. For example most phpBB forums conflate many different conversations into single threads, with ad-hoc quoting being used to disambiguate which post is replying to which. That makes it a lot harder to generate sample conversations from.

>There are relatively few places on the internet where that is true

Imageboards.

DailyMail.

Slashdot.

Even a somethingawful dump would have been superior.

Slashdot doesn't have the volume. Don't know about image boards but are they threaded and do they cover as many topics?

The Daily Mail (the newspaper) has been used for training LLMs in the past, yes. I don't know if it still is.

imageboards sure do -- poorly.

listen, some of the niche corners of that world aren't so bad, but it ain't the place to be training AI to do something, unless that something is a hate crime

So filter the content before you use it. Clearly openAI did the bare minimum on this front.
There was a video[0] on Computerphile about this topic

[0] https://www.youtube.com/watch?v=WO2X3oZEJOA

See the old SolidGoldMagikarp drama- it's happened before.
is this why reddit is charging for its API? to try to capture value from the entities training LLMs on reddit data?
that ship has probably sailed, since pushshift hosts a copy of all of reddit, including removed content. it's really weird to me that they even allowed that level of access for that long (FWIW I don't agree with the recent changes, but I see little legitimate reason to give someone a way to scan and copy all recently made posts across all subreddits).
Archiving. https://www.hackerneue.com/item?id=36254172

It's very much "he who has the power to destroy a thing controls that thing"; so long as there's only one canonical copy of a post and it's on Reddit's servers, that's vulnerable to actions by the subreddit mods or Reddit themselves.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal