I speak of temperature specifically in the context of top-p/top-k sampling.
See this Reddit comment for confirmation that my definition is a commonly accepted one: https://old.reddit.com/r/GPT3/comments/qujerp/comment/hkqoqx...
> Temperature defines how likely it is to choose less probable words. T=0 gives the same response every time because there's a 0% chance to choose any word but the most likely. T=1 is the default, it just picks based on the model's base confidence. T>1 gives more weight to unlikely words than to likely ones.
Bonus content:
> This means that a reasonably low p, like 0.8, and high temp will produce quite interesting outputs, because the model will only choose from the most likely words, but won't go for the most most likely. It's perfect for "creative" models, e.g., for writing fiction.
For example, if you instruct a child how to play rock-paper-scissors, you will instruct them to act randomly and unpredictably.
It is the same with a language model, the optimal solution involves some randomness. A temperature less than 1 will "widen the gap", so if the word choice probabilities are [0.4, 0.6], they might widen to [0.2, 0.8] with a temperature less than zero. When temperature equals zero, then [0.0, 1.0], the model will always choose the most likely word and thus become deterministic, always giving the same output.
The probability distribution gets closer to the output of the model as the temperature goes to one, and as you increase it further you tend to a uniform probability (ignoring completely the output of the model).
As someone who just implemented an LLM token sampler: `probabilities[token] = logits[token] ^ (1 / temperature)`.