Softmax Temperature
A scalar that controls the sharpness of softmax distributions; in Transformers, most notably the 1/√d_k scaling in attention that prevents gradient saturation.
type: concept title: "Softmax Temperature" tags: ["attention", "softmax", "hyperparameter", "training-stability"] related: ["Self-Attention", "Key-Query-Value Projection"] created: 2025-01-01 source: "https://arxiv.org/abs/1706.03762"
Softmax Temperature
Summary
Softmax temperature is a scalar parameter that controls the sharpness of a softmax distribution. In the context of Transformers, it is most prominent as the scaling factor $\frac{1}{\sqrt{d_k}}$ in scaled dot-product Self-Attention, where it prevents attention distributions from becoming too peaked and causing vanishing gradients.
How It Works
The softmax function applied to a vector $z$ with temperature $T$ is:
\text{softmax}(z / T)_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
- Low temperature ($T \to 0$): Distribution becomes sharper, approaching a one-hot vector. The model attends almost exclusively to the highest-scoring token.
- High temperature ($T \to \infty$): Distribution becomes more uniform. Attention is spread across all tokens.
In the attention formula, the scaling by $\sqrt{d_k}$ is equivalent to dividing scores by temperature $T = \sqrt{d_k}$. Without this scaling, dot products of $d_k$-dimensional vectors have variance approximately $d_k$, causing the softmax to saturate into near-zero gradients for large $d_k$.
Role in the Transformer
In scaled dot-product attention, $d_k = 64$ (base model), so the scale factor is $\frac{1}{8}$. This scaling is applied to the raw QK dot product scores before the softmax, keeping gradients stable throughout training. It is a fixed, non-learned parameter in the standard Transformer.
Softmax temperature also appears at inference time as a sampling hyperparameter: by adjusting the temperature applied to the final output logits, one can control the diversity vs. determinism of generated tokens.
Variants
- Fixed scaling ($\frac{1}{\sqrt{d_k}}$): Used in the original Transformer and nearly all attention implementations.
- Learned temperature: Some architectures learn a per-head or global temperature parameter.
- Inference temperature: Applied to output logits during generation to control randomness. Distinct from attention temperature.
Key Papers
- Attention Is All You Need — introduced the $\sqrt{d_k}$ scaling and motivated it by the variance analysis of dot products.
Notes
- The temperature analogy comes from statistical mechanics (Boltzmann distribution). Lower temperature = more "confident" distribution.
- Attention temperature and output generation temperature are conceptually related but serve different roles and are tuned independently.