Comment by sdenton4 - Hacker Neue

sdenton4 Apr 27, 2024 parent

You get embeddings at every activation layer of the network, at every token. That's extra state accessible to the network when running in recurrent 'generate the next token' mode.

ehsanu1 Apr 27, 2024

How much extra state and computation is it per token exactly? Can we account for the improvement in just those terms?

sdenton4 OP Apr 28, 2024

That's the point of this paper: investigating whether 'chain of thought' promoting kinda-works because it actually induces reasoning, or whether it's just that more verbose answers give the model more tokens to work with, and this more state in which to hide interesting computations. This work introduced a way to give the model more tokens - and thus compute and state - to work with, independent of the prompt, which makes it easier to separate the computational impacts of verbosity from prompting.

PoignardAzur Apr 28, 2024

Basically in a chain of N tokens, the state of the token at layer L reflects L * N / 2 intermediary states worth of info. (In practice a lot less, since attention is a somewhat noisy channel between tokens.)

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous