Decoder-Only Transformer
The decoder-only Transformer removes the encoder and cross-attention from the original architecture, leaving a stack of masked-self-attention and feed-forward blocks suited for autoregressive language modeling.
type: concept title: "Decoder-Only Transformer" tags: [architecture-pattern, decoder, autoregressive, language-modeling] related: ["Transformer", "Masked Self-Attention", "GPT-2", "Self-Attention"] created: 2025-01-01 source: "https://jalammar.github.io/illustrated-gpt2/"
Decoder-Only Transformer
Summary
The decoder-only Transformer is an architectural pattern that retains only the decoder stack of the original encoder-decoder Transformer, discarding the encoder and the cross-attention sub-layer. It is the foundational design for autoregressive language models including GPT-2 and its successors.
How It Works
Each block in a decoder-only Transformer contains exactly two sub-layers (compared to three in the full decoder):
- Masked Self-Attention sub-layer: Allows each position to attend to all prior positions (and itself) but not future ones.
- Feed-Forward Network sub-layer: A position-wise two-layer MLP that transforms each token's representation independently.
Both sub-layers use residual connections and Layer Normalization.
The input to the first block is the sum of the token Embedding and a Positional Encoding vector. The output of the final block is projected through a weight matrix (shared with or transposed from the embedding matrix) to produce Logits over the vocabulary, which are converted to a probability distribution via softmax.
At inference time, the model operates autoregressively: it produces one token per forward pass, appends it to the sequence, and re-runs — or more efficiently, uses KV caching to avoid recomputing prior token representations.
Role in the Transformer
The decoder-only pattern is a specialization of the full Transformer designed for causal (left-to-right) language modeling. It sacrifices bidirectional context (as found in encoder-only models like BERT) in exchange for the ability to generate text token-by-token without requiring a separate encoder input.
The architectural lineage is:
- Original encoder-decoder Transformer (2017) — full encoder + full decoder.
- "Generating Wikipedia by Summarizing Long Sequences" (Liu et al., 2018) — first prominent decoder-only stack, 6 blocks, 4,000 token context.
- "Character-Level Language Modeling with Deeper Self-Attention" (Al-Rfou et al., 2018) — deeper decoder-only stack at character level.
- GPT-2 (2019) — large-scale decoder-only model, up to 48 blocks, 1,024-token context.
Variants
- Encoder-only (e.g., BERT): Uses bidirectional self-attention; cannot generate autoregressively; suited for classification and understanding tasks.
- Encoder-decoder (e.g., T5): Uses both stacks with cross-attention; suited for sequence-to-sequence tasks like translation.
- Decoder-only with prefix (non-causal prefix): A hybrid where a prompt prefix is processed with full attention and only the continuation is causally masked.