Transformer Architecture

Masked Self-Attention

Masked self-attention restricts each token to attend only to itself and prior positions by zeroing out future-position scores before softmax, enforcing the causal constraint required for autoregressive language modeling.


type: concept title: "Masked Self-Attention" tags: [attention, decoder, autoregressive, causal] related: ["Self-Attention", "Multi-Head Attention", "GPT-2", "Transformer"] created: 2025-01-01 source: "https://jalammar.github.io/illustrated-gpt2/"

Masked Self-Attention

Summary

Masked self-attention is a variant of Self-Attention used in decoder blocks that prevents each token position from attending to any future (rightward) token positions, enforcing the autoregressive property required for causal language modeling.

How It Works

Masked self-attention follows the same Query-Key-Value computation as standard self-attention, but introduces a masking step before the softmax:

  1. Compute the scores matrix by multiplying the queries matrix by the transposed keys matrix.
  2. Apply an upper-triangular attention mask: set all cells representing future positions to -∞ (or a very large negative number, e.g., -1e9 in GPT-2).
  3. Apply softmax row-wise. Because exp(-∞) = 0, masked positions contribute zero weight.
  4. Multiply the resulting score matrix by the values matrix to produce the output.
AttentionMask[i][j] = 0    if j <= i   (position j is present or past)
AttentionMask[i][j] = -∞  if j > i    (position j is in the future)

Scores = softmax((Q @ K^T / sqrt(d_k)) + AttentionMask)
Output = Scores @ V

The mask is a lower-triangular matrix of zeros with the upper triangle filled with -∞. After softmax, each token can only attend to itself and all tokens that precede it.

KV Caching during inference: Because each new token only needs to attend to previously processed tokens, a practical optimization is to cache the key and value vectors computed for each prior token at each layer. On each new generation step, only the query vector for the new token is computed fresh; it is scored against the cached keys and weighted over the cached values. This avoids redundant recomputation and is used in GPT-2 at evaluation time.

Role in the Transformer

Masked self-attention is the first sub-layer in every decoder block of the original Transformer and in every block of decoder-only architectures like GPT-2. It replaces the bidirectional self-attention found in encoder blocks. The mask implements the causal constraint: a model predicting token t must not have access to tokens t+1, t+2, ....

In the full encoder-decoder Transformer, the decoder uses masked self-attention in its first sub-layer and then standard (unmasked) cross-attention in its second sub-layer to attend over encoder outputs. In decoder-only architectures, the cross-attention sub-layer is dropped entirely.

Variants

  • Standard (bidirectional) Self-Attention: No mask applied; each position can attend to all positions. Used in encoder blocks and models like BERT.
  • Sliding window / local attention masks: Restrict attention to a local neighborhood rather than all past tokens, used in long-context efficient attention variants.
  • Prefix masking: A hybrid where a prefix segment is fully visible (unmasked) and only the continuation is causally masked, used in some encoder-decoder-style decoder-only setups.

Key Papers