blackbear_ parent
While the theoretical bottleneck is there, it is far less restrictive than what you are describing, because the number of almost orthogonal vectors grows exponentially with ambient dimensionality. And orthogonality is what matters to differentiate between different vectors: since any distribution can be expressed as a mixture of Gaussians, the number of separate concepts that you can encode with such a mixture also grows exponentially
I agree that you can encode any single concept and that the encoding space of a single top pick grows exponentially.
However, I'm talking about the probability distribution of tokens.
I think within the framework of "almost-orthogonal axes" you can still create a vector that has the desired mix of projections onto any combination of these axes?
No. You can fit an exponential number of almost-orthogonal vectors into the input space, but the number of not-too-similar probability distributions over output tokens is also exponential in the output dimension. This is fine if you only care about a small subset of distributions (e.g. those that only assign significant probability to at most k tokens), but if you pick any random distribution, it's unlikely to be represented well. Fortunately, this doesn't seem to be much of an issue in practice and people even do top-k sampling intentionally.