Transformer Architecture

Softmax Temperature

A scalar that controls the sharpness of softmax distributions; in Transformers, most notably the 1/√d_k scaling in attention that prevents gradient saturation.


type: concept title: "Softmax Temperature" tags: ["attention", "softmax", "hyperparameter", "training-stability"] related: ["Self-Attention", "Key-Query-Value Projection"] created: 2025-01-01 source: "https://arxiv.org/abs/1706.03762"

Softmax Temperature

Summary

Softmax temperature is a scalar parameter that controls the sharpness of a softmax distribution. In the context of Transformers, it is most prominent as the scaling factor $\frac{1}{\sqrt{d_k}}$ in scaled dot-product Self-Attention, where it prevents attention distributions from becoming too peaked and causing vanishing gradients.

How It Works

The softmax function applied to a vector $z$ with temperature $T$ is:

\text{softmax}(z / T)_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
  • Low temperature ($T \to 0$): Distribution becomes sharper, approaching a one-hot vector. The model attends almost exclusively to the highest-scoring token.
  • High temperature ($T \to \infty$): Distribution becomes more uniform. Attention is spread across all tokens.

In the attention formula, the scaling by $\sqrt{d_k}$ is equivalent to dividing scores by temperature $T = \sqrt{d_k}$. Without this scaling, dot products of $d_k$-dimensional vectors have variance approximately $d_k$, causing the softmax to saturate into near-zero gradients for large $d_k$.

Role in the Transformer

In scaled dot-product attention, $d_k = 64$ (base model), so the scale factor is $\frac{1}{8}$. This scaling is applied to the raw QK dot product scores before the softmax, keeping gradients stable throughout training. It is a fixed, non-learned parameter in the standard Transformer.

Softmax temperature also appears at inference time as a sampling hyperparameter: by adjusting the temperature applied to the final output logits, one can control the diversity vs. determinism of generated tokens.

Variants

  • Fixed scaling ($\frac{1}{\sqrt{d_k}}$): Used in the original Transformer and nearly all attention implementations.
  • Learned temperature: Some architectures learn a per-head or global temperature parameter.
  • Inference temperature: Applied to output logits during generation to control randomness. Distinct from attention temperature.

Key Papers

  • Attention Is All You Need — introduced the $\sqrt{d_k}$ scaling and motivated it by the variance analysis of dot products.

Notes

  • The temperature analogy comes from statistical mechanics (Boltzmann distribution). Lower temperature = more "confident" distribution.
  • Attention temperature and output generation temperature are conceptually related but serve different roles and are tuned independently.