Comment by johnfn - Hacker Neue

johnfn Mar 21, 2023 parent

Just did a test on GPT-4:

Me: The following text on the next line has a secret word inside it. What is that word?

xxxxxxxxxEGGxxxxxxxxxx

GPT-4: The secret word inside the text is "EGG".

I'm with you, though; I thought things were tokenized! But this example clearly shows that's not the case.

autophagian Mar 21, 2023

It is, though now how you'd expect. OpenAI have a tool that lets you see how text is tokenized: https://platform.openai.com/tokenizer

This only has GPT3 for now, but I imagine results are similar. "xxxxxxxxxEGGxxxxxxxxxx" gets tokenized as [xxxxxxxx][x][EG][G][xxxxxxxx][xx], so i could see how it could 'see' the secret word.

jxy Mar 21, 2023

Even the 4bit quantized llama 13B tuned with rlhf-lora on alpaca dataset got it right.

    sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.750000, repeat_last_n = 64, repeat_penalty = 1.000000


     Below is an instruction that describes a task. Write a response that appropriately completes the request.

    ### Instruction:
    Each line in the following has a secret word inside it. Find the secret word in each line.
    1. eeeeeeggeeeee
    2. eeeeeeggggggg
    3. eeeeeeyeeeeee
    4. eeeeeeyeyeeee

    ### Response:
    1. egg
    2. egg
    3. eye
    4. eye [end of text]

jxy Mar 21, 2023

I was very curious and checked with this string, "eeeeeeggeeeee", which should be tokenized as [eeee][ee][g][ge][eeee]. Both GPT3.5 and GPT4 gave me "egg", which is a single token.

johnfn OP Mar 21, 2023

Ahh this is really interesting! Thanks for sharing.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous