Layer Normalization
Layer Normalization standardizes each token's feature vector independently of batch and sequence dimensions, applied at each sub-layer of Transformer blocks to stabilize training; GPT-2 uses a pre-norm variant where normalization precedes each sub-layer.
type: concept title: "Layer Normalization" tags: [normalization, training-stability, transformer] related: ["Transformer", "Feed-Forward Network", "Multi-Head Attention", "GPT-2"] created: 2025-01-01 source: "https://arxiv.org/abs/1607.06450"
Layer Normalization
Summary
Layer Normalization standardizes the activations across the feature dimension within each Transformer sub-layer, stabilizing the training of deep stacked blocks by reducing internal covariate shift without dependence on batch size.
How It Works
For an input vector x of dimension d_model, Layer Normalization computes:
$$\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma + \epsilon} + \beta$$
where μ and σ are the mean and standard deviation computed across the d_model features for that single token (not across the batch or sequence), and γ and β are learned scale and shift parameters.
This means every token's representation is normalized independently, making the operation insensitive to batch size — an important property for language models trained with variable sequence lengths.
Role in the Transformer
Layer Normalization appears at each sub-layer of every Transformer block, applied to the residual stream before or after the sub-layer computation:
- Post-norm (original Attention Is All You Need convention):
LayerNorm(x + Sublayer(x)) - Pre-norm (used in GPT-2 and most modern models):
x + Sublayer(LayerNorm(x))
The pre-norm arrangement has been found empirically to stabilize training of very deep models and is the convention used in GPT-2.
Layer Normalization is applied pervasively throughout transformer architectures and plays a critical role in making deep stacks (12–48+ blocks) trainable.
Variants
- Batch Normalization: Normalizes across the batch dimension; not suitable for variable-length sequences or small batches.
- RMS Norm (Root Mean Square Normalization): Omits the mean-centering step, using only the RMS of activations; used in LLaMA and other recent models for efficiency.
- Group Normalization: Normalizes over subsets of the feature dimension; less common in Transformers.