Transformer Architecture

Key-Query-Value Projection

The three learned linear projections (Query, Key, Value) that transform input representations into the distinct roles required for scaled dot-product attention.


type: component title: "Key-Query-Value Projection" tags: ["attention", "transformer", "component", "projection"] part_of: ["Transformer"] created: 2025-01-01

Key-Query-Value Projection

Purpose

The Key-Query-Value (KQV) projection transforms input token representations into three distinct roles — Query, Key, and Value — that are used in the Self-Attention computation. These projections enable the attention mechanism to separate "what to look for" (Query), "what to match against" (Key), and "what to return" (Value).

Inputs and Outputs

  • Input: Token embedding or hidden state matrix $X \in \mathbb{R}^{n \times d_{\text{model}}}$.
  • Outputs:
    • $Q = X W^Q \in \mathbb{R}^{n \times d_k}$ — Query matrix
    • $K = X W^K \in \mathbb{R}^{n \times d_k}$ — Key matrix
    • $V = X W^V \in \mathbb{R}^{n \times d_v}$ — Value matrix
  • In the base Transformer: $d_k = d_v = 64$ per head, $d_{\text{model}} = 512$.

Implementation

# Simplified pseudocode for a single attention head
def kqv_projection(X, W_Q, W_K, W_V):
    # X: (seq_len, d_model)
    # W_Q, W_K, W_V: (d_model, d_k)
    Q = X @ W_Q  # (seq_len, d_k)
    K = X @ W_K  # (seq_len, d_k)
    V = X @ W_V  # (seq_len, d_v)
    return Q, K, V

For Multi-Head Attention, each head maintains its own independent set of weight matrices $W^Q_i$, $W^K_i$, $W^V_i$, trained to project into different subspaces.

Design Choices and Hyperparameters

  • Projection dimension ($d_k$, $d_v$): Set to $d_{\text{model}} / h$ in the original paper to keep total computation roughly constant regardless of head count.
  • Separate projections for Q, K, V: The asymmetry between Query and Key allows the model to learn different matching criteria than retrieval criteria. Using the same projection for Q and K would constrain attention to be symmetric.
  • No activation function: The projections are purely linear. Non-linearity enters through the softmax and the downstream Feed-Forward Network.

Related Concepts

  • Self-Attention — uses Q, K, V to compute attention weights and output representations.
  • Softmax Temperature — the scaling by $\sqrt{d_k}$ in the attention formula compensates for the growth in dot product magnitude with larger projection dimensions.

Notes

  • The Query/Key/Value framing is an analogy to information retrieval: a query is matched against keys to retrieve associated values.
  • The value projection $W^V$ and the output projection $W^O$ in Multi-Head Attention together function as a low-rank decomposition of a single larger projection; this has been exploited in compression techniques like LoRA adapters.