Transformer Architecture

ALiBi

ALiBi adds a fixed head-specific linear distance penalty to attention scores at every layer, enabling length extrapolation beyond the training context window without any learned positional parameters.


type: concept title: "ALiBi" tags: [position, encoding, attention, extrapolation] related: ["Positional Encoding", "Rotary Position Embedding", "Self-Attention", "Context Window"] created: 2023-01-27 source: "https://arxiv.org/abs/2108.12409"

ALiBi

Summary

Attention with Linear Biases (ALiBi; Press et al., 2022) replaces explicit positional encodings with a fixed linear distance penalty added to every query-key attention score, enabling Transformers trained on short contexts to extrapolate to significantly longer sequences at inference time.

How It Works

Instead of encoding position in the input embeddings or in Q/K rotations, ALiBi adds a scalar bias to each attention score that is proportional to the distance between the query and key tokens:

$$ \text{softmax}!\left(q_i K^\top + \alpha_i \cdot [0, -1, -2, \ldots, -(i-1)]\right) $$

where $\alpha_i$ is a head-specific, non-learned scalar that controls the slope of the penalty. For $h$ attention heads, $\alpha_i$ is set as a geometric sequence — for example, with 8 heads:

$$\alpha_i = \frac{1}{2},; \frac{1}{2^2},; \frac{1}{2^3},; \ldots,; \frac{1}{2^8}$$

This introduces a strong recency preference: keys far from the query receive a larger negative bias, making distant tokens harder to attend to. Different heads use different slopes, allowing the model to learn a mixture of short- and long-range preferences.

Role in the Transformer

ALiBi is applied at every attention layer, modifying the pre-softmax attention logits. It requires no additional parameters and no changes to the embedding layer. Because the bias is additive and defined for any distance, the model is not constrained to the training context length at inference.

Variants

  • Rotary Position Embedding (RoPE) — injects relative position via rotation of Q/K rather than additive bias; see Rotary Position Embedding.
  • DA-Transformer — uses a learnable multiplicative distance-based weighting function rather than a fixed additive bias; described in Positional Encoding variants.
  • Sinusoidal / Learned absolute — input-level only; cannot extrapolate; see Positional Encoding.

Key Papers

  • Press et al. (2022), "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (arXiv:2108.12409)

Notes

Press et al. (2022) demonstrated that a 1.3B parameter model trained on context length 1,024 could extrapolate to 2,048 tokens at inference time with ALiBi, while sinusoidal, RoPE, and T5-style relative encodings all degraded sharply beyond their training length. The non-learned slopes are a deliberate design choice: fixing them avoids the risk of the model learning slopes that do not generalise.