Key-Query-Value Projection
The three learned linear projections (Query, Key, Value) that transform input representations into the distinct roles required for scaled dot-product attention.
type: component title: "Key-Query-Value Projection" tags: ["attention", "transformer", "component", "projection"] part_of: ["Transformer"] created: 2025-01-01
Key-Query-Value Projection
Purpose
The Key-Query-Value (KQV) projection transforms input token representations into three distinct roles — Query, Key, and Value — that are used in the Self-Attention computation. These projections enable the attention mechanism to separate "what to look for" (Query), "what to match against" (Key), and "what to return" (Value).
Inputs and Outputs
- Input: Token embedding or hidden state matrix $X \in \mathbb{R}^{n \times d_{\text{model}}}$.
- Outputs:
- $Q = X W^Q \in \mathbb{R}^{n \times d_k}$ — Query matrix
- $K = X W^K \in \mathbb{R}^{n \times d_k}$ — Key matrix
- $V = X W^V \in \mathbb{R}^{n \times d_v}$ — Value matrix
- In the base Transformer: $d_k = d_v = 64$ per head, $d_{\text{model}} = 512$.
Implementation
# Simplified pseudocode for a single attention head
def kqv_projection(X, W_Q, W_K, W_V):
# X: (seq_len, d_model)
# W_Q, W_K, W_V: (d_model, d_k)
Q = X @ W_Q # (seq_len, d_k)
K = X @ W_K # (seq_len, d_k)
V = X @ W_V # (seq_len, d_v)
return Q, K, V
For Multi-Head Attention, each head maintains its own independent set of weight matrices $W^Q_i$, $W^K_i$, $W^V_i$, trained to project into different subspaces.
Design Choices and Hyperparameters
- Projection dimension ($d_k$, $d_v$): Set to $d_{\text{model}} / h$ in the original paper to keep total computation roughly constant regardless of head count.
- Separate projections for Q, K, V: The asymmetry between Query and Key allows the model to learn different matching criteria than retrieval criteria. Using the same projection for Q and K would constrain attention to be symmetric.
- No activation function: The projections are purely linear. Non-linearity enters through the softmax and the downstream Feed-Forward Network.
Related Concepts
- Self-Attention — uses Q, K, V to compute attention weights and output representations.
- Softmax Temperature — the scaling by $\sqrt{d_k}$ in the attention formula compensates for the growth in dot product magnitude with larger projection dimensions.
Notes
- The Query/Key/Value framing is an analogy to information retrieval: a query is matched against keys to retrieve associated values.
- The value projection $W^V$ and the output projection $W^O$ in Multi-Head Attention together function as a low-rank decomposition of a single larger projection; this has been exploited in compression techniques like LoRA adapters.