Transformer Architecture

Self-Attention

Self-attention allows each token in a sequence to attend to all other tokens by computing dot-product scores between learned Query and Key projections, then aggregating Value vectors weighted by those scores to produce context-aware token representations.


type: concept title: "Self-Attention" tags: [attention, transformer, core-mechanism] related: ["Multi-Head Attention", "Key-Query-Value Projection", "Masked Self-Attention", "Softmax Temperature", "Transformer"] created: 2025-01-01 source: "https://arxiv.org/abs/1706.03762"

Self-Attention

Summary

Self-attention is the mechanism by which each token in a sequence attends to all other tokens via Query, Key, and Value projections, producing a context-aware representation of each token as a weighted sum of all value vectors. It is the core computational primitive of the Transformer.

How It Works

Self-attention is computed in three steps:

Step 1 — Create Q, K, V vectors: Each input token's representation is projected through three learned weight matrices to produce a Query, Key, and Value vector. See Key-Query-Value Projection.

Step 2 — Score: The query vector of the current token is compared (dot product) against the key vectors of all tokens in the segment. The scores are scaled by 1/√d_k (see Softmax Temperature) and passed through softmax to produce attention weights.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Step 3 — Sum: Each value vector is multiplied by its corresponding attention weight, and the results are summed to produce the output for the current position.

Intuition: The Query represents "what I'm looking for", the Key represents "what I contain", and the Value represents "what I contribute if selected." The analogy to a filing cabinet is useful: the query is a search note, the keys are folder labels, and the values are the folder contents — but instead of retrieving one folder, a weighted blend of all folders is returned.

In an encoder (bidirectional) self-attention block, every token can attend to every other token. In a decoder, Masked Self-Attention restricts each token to attend only to itself and preceding tokens.

Role in the Transformer

Self-attention is the first sub-layer in every Transformer encoder block. It enables the model to build context-sensitive representations: the output vector for any given token is informed by all other tokens in the sequence, weighted by learned relevance. This allows long-range dependencies to be captured in a single layer, unlike RNNs that process tokens sequentially.

In practice, self-attention is applied in its multi-head form: see Multi-Head Attention.

Variants

  • Masked Self-Attention: Causal variant used in decoder blocks; future positions are masked to zero.
  • Cross-attention: A generalization where Q comes from one sequence (the decoder) and K, V come from another (the encoder output).
  • Multi-Head Attention: Runs self-attention in parallel across multiple subspaces of Q, K, V.
  • Sparse attention: Restricts which positions can attend to which, reducing the O(n²) complexity.

Key Papers