Preferences

Just did a test on GPT-4:

Me: The following text on the next line has a secret word inside it. What is that word?

xxxxxxxxxEGGxxxxxxxxxx

GPT-4: The secret word inside the text is "EGG".

I'm with you, though; I thought things were tokenized! But this example clearly shows that's not the case.


It is, though now how you'd expect. OpenAI have a tool that lets you see how text is tokenized: https://platform.openai.com/tokenizer

This only has GPT3 for now, but I imagine results are similar. "xxxxxxxxxEGGxxxxxxxxxx" gets tokenized as [xxxxxxxx][x][EG][G][xxxxxxxx][xx], so i could see how it could 'see' the secret word.

Even the 4bit quantized llama 13B tuned with rlhf-lora on alpaca dataset got it right.

    sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.750000, repeat_last_n = 64, repeat_penalty = 1.000000


     Below is an instruction that describes a task. Write a response that appropriately completes the request.

    ### Instruction:
    Each line in the following has a secret word inside it. Find the secret word in each line.
    1. eeeeeeggeeeee
    2. eeeeeeggggggg
    3. eeeeeeyeeeeee
    4. eeeeeeyeyeeee

    ### Response:
    1. egg
    2. egg
    3. eye
    4. eye [end of text]
I was very curious and checked with this string, "eeeeeeggeeeee", which should be tokenized as [eeee][ee][g][ge][eeee]. Both GPT3.5 and GPT4 gave me "egg", which is a single token.
Ahh this is really interesting! Thanks for sharing.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal