Preferences

A temperature of 1 just means using the probabilities directly from the model, you're looking for temperatures above 1 (which increase the probability of more unlikely tokens).

As someone who just implemented an LLM token sampler: `probabilities[token] = logits[token] ^ (1 / temperature)`.


Ah, what does a 0 <= temperature < 1 mean?
Usually that temperature must be in the range [0, 1) (inclusive of 0 and exclusive of 1). There's no technical reason that sampling must be done this way, unless some implementations use a different mathematical definition for temperature (which is possible). If so, a temperature of 1 might mean "totally random/uniform sampling" in that case.

I speak of temperature specifically in the context of top-p/top-k sampling.

See this Reddit comment for confirmation that my definition is a commonly accepted one: https://old.reddit.com/r/GPT3/comments/qujerp/comment/hkqoqx...

> Temperature defines how likely it is to choose less probable words. T=0 gives the same response every time because there's a 0% chance to choose any word but the most likely. T=1 is the default, it just picks based on the model's base confidence. T>1 gives more weight to unlikely words than to likely ones.

Bonus content:

> This means that a reasonably low p, like 0.8, and high temp will produce quite interesting outputs, because the model will only choose from the most likely words, but won't go for the most most likely. It's perfect for "creative" models, e.g., for writing fiction.

Makes sense, thanks!
Sometimes the optimal solution to a POMDP (partially observable Markov decision process) often takes random actions. That is, if we only know part of the state, often the optimal course of action requires acting randomly.

For example, if you instruct a child how to play rock-paper-scissors, you will instruct them to act randomly and unpredictably.

It is the same with a language model, the optimal solution involves some randomness. A temperature less than 1 will "widen the gap", so if the word choice probabilities are [0.4, 0.6], they might widen to [0.2, 0.8] with a temperature less than zero. When temperature equals zero, then [0.0, 1.0], the model will always choose the most likely word and thus become deterministic, always giving the same output.

Zero temperature means that the most probable token is produced each time.

The probability distribution gets closer to the output of the model as the temperature goes to one, and as you increase it further you tend to a uniform probability (ignoring completely the output of the model).

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal