The Feed-Forward Network is a two-layer MLP applied independently at each sequence position within a Transformer block, expanding to 4× model dimension before projecting back, providing non-linear per-position transformation after attention.

type: component title: "Feed-Forward Network" tags: [component, transformer, mlp] part_of: ["Transformer"] created: 2025-01-01

Feed-Forward Network

Purpose

The Feed-Forward Network (FFN) provides a non-linear, position-wise transformation within each Transformer block. Applied independently to each token position after attention has mixed information across positions, it acts as the model's per-position "processing" layer where learned features are combined and transformed.

Inputs and Outputs

Input: A vector of shape (seq_len, d_model) — the output of the attention sub-layer (after residual addition and Layer Normalization).
Output: A vector of the same shape (seq_len, d_model) — the result of two linear transformations with a non-linearity in between.

Implementation

def feed_forward_network(x, W1, b1, W2, b2):
    # Layer 1: expand to 4 * d_model
    h = activation(x @ W1 + b1)  # (seq_len, 4 * d_model)
    # Layer 2: project back to d_model
    out = h @ W2 + b2             # (seq_len, d_model)
    return out

The inner (hidden) dimension is 4× the model dimension. For GPT-2 small (d_model=768), this yields an inner dimension of 3,072. For the original Transformer (d_model=512), the inner dimension is 2,048. The original Transformer used ReLU as the activation; GPT-2 uses GELU.

Each position is processed independently with the same weights — the FFN is shared across positions but each position's vector is transformed in isolation.

Design Choices and Hyperparameters

Inner dimension multiplier (×4): Established in the original Attention Is All You Need paper and retained in GPT-2. Empirically found to provide sufficient representational capacity.
Activation function: ReLU in the original Transformer; GELU in GPT-2 and most subsequent models. GELU provides a smoother non-linearity with better empirical performance on language tasks.
Bias vectors: Both linear layers include learned bias terms (though these are sometimes omitted in more recent architectures).
Parameter count: For GPT-2 small, the two weight matrices are (768×3072) and (3072×768), contributing approximately 4.7M parameters per block, times 12 blocks.

Related Concepts

Self-Attention: The sub-layer that precedes the FFN in each Transformer block.
Layer Normalization: Applied before (pre-norm) or after (post-norm) the FFN sub-layer.

Notes

The FFN is the largest single parameter contributor in a standard Transformer block — larger than the attention weight matrices for typical d_model values.
The expansion-then-contraction pattern (d_model → 4×d_model → d_model) is sometimes interpreted as the model using the expanded space to represent a richer set of intermediate features before projecting back.
In GPT-2, the FFN follows a pre-norm convention (Layer Normalization is applied to the input before each sub-layer), differing from the post-norm convention in the original Transformer paper.