Transformer Architecture

Rotary Position Embedding

RoPE encodes position by rotating Query and Key vectors with a block-diagonal rotation matrix, ensuring attention scores depend only on relative position offset.


type: concept title: "Rotary Position Embedding" tags: [position, encoding, attention, rope] related: ["Positional Encoding", "Self-Attention", "Key-Query-Value Projection", "ALiBi"] created: 2023-01-27 source: "https://arxiv.org/abs/2104.09864"

Rotary Position Embedding

Summary

Rotary Position Embedding (RoPE; Su et al., 2021) encodes absolute position via a rotation matrix applied to Query and Key projections at every attention layer, producing inner products that depend only on the relative offset between positions — combining the strengths of absolute and relative encodings.

How It Works

RoPE frames the problem as: given the $i$-th query and $j$-th key, formulate their inner product so that the result is a function of only the relative position $i - j$.

2D Case

For a vector $z$, a counter-clockwise rotation by angle $\theta$ is:

$$R = \begin{bmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{bmatrix}$$

General $d$-Dimensional Case

The $d$-dimensional space is divided into $d/2$ independent 2D subspaces. For a token at position $i$, the rotation matrix $R^d_{\Theta,i}$ is a block-diagonal matrix of size $d \times d$:

$$ R^d_{\Theta,i} = \begin{bmatrix} \cos i\theta_1 & -\sin i\theta_1 & 0 & 0 & \cdots \ \sin i\theta_1 & \cos i\theta_1 & 0 & 0 & \cdots \ 0 & 0 & \cos i\theta_2 & -\sin i\theta_2 & \cdots \ 0 & 0 & \sin i\theta_2 & \cos i\theta_2 & \cdots \ \vdots & & & & \ddots \end{bmatrix} $$

where $\Theta = {\theta_i = 10000^{-2(i-1)/d},; i \in [1, 2, \ldots, d/2]}$ — the same base frequencies as sinusoidal encoding.

Applying to Attention

Both Query and Key are multiplied by the rotation matrix for their respective positions:

$$ q_i^\top k_j = (R^d_{\Theta,i}, W^q x_i)^\top (R^d_{\Theta,j}, W^k x_j) = x_i^\top W^q, R^d_{\Theta,,j-i}, W^k x_j $$

where $R^d_{\Theta,,j-i} = (R^d_{\Theta,i})^\top R^d_{\Theta,j}$ is itself a rotation matrix parameterised by the relative offset $j - i$. This means the attention score depends only on content and relative position, never on absolute position.

Role in the Transformer

RoPE replaces the input-level positional encoding. It is applied at every attention layer by rotating the Query and Key matrices before the dot-product is computed, leaving Value projections untouched. It does not add any parameters beyond the fixed frequency schedule $\Theta$.

Variants

  • ALiBi — an alternative distance-based approach that adds a linear penalty bias to attention scores rather than rotating Q/K; see ALiBi.
  • Sinusoidal / Learned absolute — applied once at input only; see Positional Encoding.
  • Transformer-XL relative encoding — a learnable reparameterisation of relative position in key projections; see Transformer-XL.

Key Papers

  • Su et al. (2021), "RoFormer: Enhanced Transformer with Rotary Position Embedding" (arXiv:2104.09864)

Notes

RoPE has become the dominant positional encoding strategy in post-2022 decoder-only architectures (e.g., LLaMA) because it enables better length extrapolation than learned absolute encodings and integrates cleanly with every attention layer. The formulation is mathematically equivalent to sinusoidal encoding but expressed as a rotation, which is what makes the relative-position property fall out naturally from the inner product.