Layer Normalization standardizes each token's feature vector independently of batch and sequence dimensions, applied at each sub-layer of Transformer blocks to stabilize training; GPT-2 uses a pre-norm variant where normalization precedes each sub-layer.

type: concept title: "Layer Normalization" tags: [normalization, training-stability, transformer] related: ["Transformer", "Feed-Forward Network", "Multi-Head Attention", "GPT-2"] created: 2025-01-01 source: "https://arxiv.org/abs/1607.06450"

Layer Normalization

Summary

Layer Normalization standardizes the activations across the feature dimension within each Transformer sub-layer, stabilizing the training of deep stacked blocks by reducing internal covariate shift without dependence on batch size.

How It Works

For an input vector x of dimension d_model, Layer Normalization computes:

$$\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma + \epsilon} + \beta$$

where μ and σ are the mean and standard deviation computed across the d_model features for that single token (not across the batch or sequence), and γ and β are learned scale and shift parameters.

This means every token's representation is normalized independently, making the operation insensitive to batch size — an important property for language models trained with variable sequence lengths.

Role in the Transformer

Layer Normalization appears at each sub-layer of every Transformer block, applied to the residual stream before or after the sub-layer computation:

Post-norm (original Attention Is All You Need convention): LayerNorm(x + Sublayer(x))
Pre-norm (used in GPT-2 and most modern models): x + Sublayer(LayerNorm(x))

The pre-norm arrangement has been found empirically to stabilize training of very deep models and is the convention used in GPT-2.

Layer Normalization is applied pervasively throughout transformer architectures and plays a critical role in making deep stacks (12–48+ blocks) trainable.

Variants

Batch Normalization: Normalizes across the batch dimension; not suitable for variable-length sequences or small batches.
RMS Norm (Root Mean Square Normalization): Omits the mean-centering step, using only the RMS of activations; used in LLaMA and other recent models for efficiency.
Group Normalization: Normalizes over subsets of the feature dimension; less common in Transformers.

Key Papers

Attention Is All You Need