Feed-Forward Network
The Feed-Forward Network is a two-layer MLP applied independently at each sequence position within a Transformer block, expanding to 4× model dimension before projecting back, providing non-linear per-position transformation after attention.
type: component title: "Feed-Forward Network" tags: [component, transformer, mlp] part_of: ["Transformer"] created: 2025-01-01
Feed-Forward Network
Purpose
The Feed-Forward Network (FFN) provides a non-linear, position-wise transformation within each Transformer block. Applied independently to each token position after attention has mixed information across positions, it acts as the model's per-position "processing" layer where learned features are combined and transformed.
Inputs and Outputs
- Input: A vector of shape
(seq_len, d_model)— the output of the attention sub-layer (after residual addition and Layer Normalization). - Output: A vector of the same shape
(seq_len, d_model)— the result of two linear transformations with a non-linearity in between.
Implementation
def feed_forward_network(x, W1, b1, W2, b2):
# Layer 1: expand to 4 * d_model
h = activation(x @ W1 + b1) # (seq_len, 4 * d_model)
# Layer 2: project back to d_model
out = h @ W2 + b2 # (seq_len, d_model)
return out
The inner (hidden) dimension is 4× the model dimension. For GPT-2 small (d_model=768), this yields an inner dimension of 3,072. For the original Transformer (d_model=512), the inner dimension is 2,048. The original Transformer used ReLU as the activation; GPT-2 uses GELU.
Each position is processed independently with the same weights — the FFN is shared across positions but each position's vector is transformed in isolation.
Design Choices and Hyperparameters
- Inner dimension multiplier (×4): Established in the original Attention Is All You Need paper and retained in GPT-2. Empirically found to provide sufficient representational capacity.
- Activation function: ReLU in the original Transformer; GELU in GPT-2 and most subsequent models. GELU provides a smoother non-linearity with better empirical performance on language tasks.
- Bias vectors: Both linear layers include learned bias terms (though these are sometimes omitted in more recent architectures).
- Parameter count: For GPT-2 small, the two weight matrices are (768×3072) and (3072×768), contributing approximately 4.7M parameters per block, times 12 blocks.
Related Concepts
- Self-Attention: The sub-layer that precedes the FFN in each Transformer block.
- Layer Normalization: Applied before (pre-norm) or after (post-norm) the FFN sub-layer.
Notes
- The FFN is the largest single parameter contributor in a standard Transformer block — larger than the attention weight matrices for typical d_model values.
- The expansion-then-contraction pattern (d_model → 4×d_model → d_model) is sometimes interpreted as the model using the expanded space to represent a richer set of intermediate features before projecting back.
- In GPT-2, the FFN follows a pre-norm convention (Layer Normalization is applied to the input before each sub-layer), differing from the post-norm convention in the original Transformer paper.