sdenton4 parent
You get embeddings at every activation layer of the network, at every token. That's extra state accessible to the network when running in recurrent 'generate the next token' mode.
How much extra state and computation is it per token exactly? Can we account for the improvement in just those terms?
That's the point of this paper: investigating whether 'chain of thought' promoting kinda-works because it actually induces reasoning, or whether it's just that more verbose answers give the model more tokens to work with, and this more state in which to hide interesting computations. This work introduced a way to give the model more tokens - and thus compute and state - to work with, independent of the prompt, which makes it easier to separate the computational impacts of verbosity from prompting.
Basically in a chain of N tokens, the state of the token at layer L reflects L * N / 2 intermediary states worth of info. (In practice a lot less, since attention is a somewhat noisy channel between tokens.)