Transformer Architecture

Decoder-Only Transformer

The decoder-only Transformer removes the encoder and cross-attention from the original architecture, leaving a stack of masked-self-attention and feed-forward blocks suited for autoregressive language modeling.


type: concept title: "Decoder-Only Transformer" tags: [architecture-pattern, decoder, autoregressive, language-modeling] related: ["Transformer", "Masked Self-Attention", "GPT-2", "Self-Attention"] created: 2025-01-01 source: "https://jalammar.github.io/illustrated-gpt2/"

Decoder-Only Transformer

Summary

The decoder-only Transformer is an architectural pattern that retains only the decoder stack of the original encoder-decoder Transformer, discarding the encoder and the cross-attention sub-layer. It is the foundational design for autoregressive language models including GPT-2 and its successors.

How It Works

Each block in a decoder-only Transformer contains exactly two sub-layers (compared to three in the full decoder):

  1. Masked Self-Attention sub-layer: Allows each position to attend to all prior positions (and itself) but not future ones.
  2. Feed-Forward Network sub-layer: A position-wise two-layer MLP that transforms each token's representation independently.

Both sub-layers use residual connections and Layer Normalization.

The input to the first block is the sum of the token Embedding and a Positional Encoding vector. The output of the final block is projected through a weight matrix (shared with or transposed from the embedding matrix) to produce Logits over the vocabulary, which are converted to a probability distribution via softmax.

At inference time, the model operates autoregressively: it produces one token per forward pass, appends it to the sequence, and re-runs — or more efficiently, uses KV caching to avoid recomputing prior token representations.

Role in the Transformer

The decoder-only pattern is a specialization of the full Transformer designed for causal (left-to-right) language modeling. It sacrifices bidirectional context (as found in encoder-only models like BERT) in exchange for the ability to generate text token-by-token without requiring a separate encoder input.

The architectural lineage is:

  1. Original encoder-decoder Transformer (2017) — full encoder + full decoder.
  2. "Generating Wikipedia by Summarizing Long Sequences" (Liu et al., 2018) — first prominent decoder-only stack, 6 blocks, 4,000 token context.
  3. "Character-Level Language Modeling with Deeper Self-Attention" (Al-Rfou et al., 2018) — deeper decoder-only stack at character level.
  4. GPT-2 (2019) — large-scale decoder-only model, up to 48 blocks, 1,024-token context.

Variants

  • Encoder-only (e.g., BERT): Uses bidirectional self-attention; cannot generate autoregressively; suited for classification and understanding tasks.
  • Encoder-decoder (e.g., T5): Uses both stacks with cross-attention; suited for sequence-to-sequence tasks like translation.
  • Decoder-only with prefix (non-causal prefix): A hybrid where a prompt prefix is processed with full attention and only the continuation is causally masked.

Key Papers