Updated to cover sinusoidal, learned, relative (Shaw), Transformer-XL, RoPE, ALiBi, and DA-Transformer positional encoding variants with full formulations.

type: concept title: "Positional Encoding" tags: [attention, position, encoding, transformer-basics] related: ["Self-Attention", "Transformer", "Rotary Position Embedding", "ALiBi", "Transformer-XL"] created: 2023-01-27 source: "https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/"

Positional Encoding

Summary

Positional encoding injects sequence-order information into Transformer inputs, compensating for the permutation-invariant nature of Self-Attention. Multiple strategies exist, ranging from fixed sinusoidal functions to learned vectors to relative and rotary formulations applied at every attention layer.

How It Works

The positional encoding matrix $P \in \mathbb{R}^{L \times d}$ has the same dimension as the input embedding, so it can be added directly to the input.

Sinusoidal Positional Encoding

For token position $i = 1, \ldots, L$ and dimension $\delta = 1, \ldots, d$:

$$ \text{PE}(i, \delta) = \begin{cases} \sin!\left(\frac{i}{10000^{2\delta'/d}}\right) & \text{if } \delta = 2\delta' \ \cos!\left(\frac{i}{10000^{2\delta'/d}}\right) & \text{if } \delta = 2\delta'+1 \end{cases} $$

Each dimension corresponds to a sinusoid of a different wavelength, ranging from $2\pi$ to $10000 \cdot 2\pi$.

Learned Positional Encoding

Each position is assigned a learned column vector encoding its absolute position (Gehring et al., 2017). The encoding can additionally be learned differently per layer (Al-Rfou et al., 2018). GPT-2 uses this variant with one trainable vector per position up to 1,024.

Relative Position Encoding

Shaw et al. (2018) incorporated relative positional information directly into the $W^k$ and $W^v$ weight matrices. The maximum relative position is clipped to absolute value $k$, producing $2k+1$ unique edge labels with learnable representations $P^k, P^v \in \mathbb{R}^{2k+1}$:

$$A^k_{ij} = P^k_{\text{clip}(j-i,,k)}, \quad A^v_{ij} = P^v_{\text{clip}(j-i,,k)}$$

where $\text{clip}(x, k) = \text{clip}(x, -k, k)$. Clipping enables generalisation to unseen sequence lengths.

Transformer-XL Relative Encoding

Transformer-XL reparameterises the dot-product attention score between query $i$ and key $j$ into four interpretable terms:

$$ a^{\text{rel}}{ij} = \underbrace{x_i W^q W^{\top}{E_k} x_j^\top}{\text{content-based addressing}} +\underbrace{x_i W^q W^{\top}{R_k} r_{i-j}^\top}{\text{content-dependent positional bias}} +\underbrace{u, W^{\top}{E_k} x_j^\top}{\text{global content bias}} +\underbrace{v, W^{\top}{R_k} r_{i-j}^\top}_{\text{global positional bias}} $$

where $r_{i-j}$ is a sinusoidal relative position encoding, $u$ and $v$ are trainable vectors, and $W^k$ is split into content key matrix $W_{E_k}$ and location key matrix $W_{R_k}$.

Role in the Transformer

Positional encoding is applied once to the input token embeddings before the first Transformer block. In more advanced variants (RoPE, ALiBi, Transformer-XL relative encoding), positional information is injected at every attention layer instead of only at the input.

Variants

Sinusoidal — fixed, no learned parameters, used in the original Transformer.
Learned absolute — trainable per-position vectors; used in GPT-2.
Relative (Shaw et al.) — encodes pairwise distance in key/value projections.
Transformer-XL relative — reparameterised four-term decomposition enabling cross-segment coherence; see Transformer-XL.
Rotary Position Embedding (RoPE) — rotates Q/K matrices by position-proportional angles so that inner products depend only on relative position; see Rotary Position Embedding.
ALiBi — adds a fixed linear distance penalty to attention scores at every layer, facilitating length extrapolation; see ALiBi.
Distance-Aware Transformer (DA-Transformer) — multiplies attention scores by a learnable distance-based weighting function per head.

Key Papers

Notes

A key motivation for moving from absolute to relative encoding is length extrapolation: models trained on short contexts should generalise to longer ones at inference time. ALiBi and RoPE are the leading approaches as of early 2023. Sinusoidal and learned absolute encodings both fail to extrapolate cleanly beyond their training length.