Transformer Architecture

Adaptive Attention Span

Adaptive Attention Span trains a per-head soft mask parameter that continuously adjusts each attention head's effective context window length, saving computation and revealing specialization across heads.


type: concept title: "Adaptive Attention Span" tags: [efficiency, attention, adaptive, sparse] related: ["Self-Attention", "Sparse Attention", "Multi-Head Attention"] created: 2023-01-27 source: "https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/"

Adaptive Attention Span

Summary

Adaptive Attention Span (Sukhbaatar et al., 2019) allows each attention head to learn its own optimal context window length, reducing computation by attending only as far back as necessary while preserving model quality.

How It Works

Standard attention for the $i$-th token over a span of size $s$:

$$ a_{ij} = \text{softmax}(e_{ij}) = \frac{\exp(e_{ij})}{\sum_{r=i-s}^{i-1} \exp(e_{ir})}, \quad y_i = \sum_{r=i-s}^{i-1} a_{ir} x_r W^v $$

A soft mask function $m_z$ modulates each attention weight based on the distance between query and key, parameterised by a learnable span parameter $z \in [0, s]$:

$$ m_z(x) = \text{clip}!\left(\frac{1}{R}(R + z - x),, 0,, 1\right) $$

where $R$ is a hyperparameter controlling softness. This is applied element-wise before normalisation:

$$ a_{ij} = \frac{m_z(i-j), \exp(s_{ij})}{\sum_{r=i-s}^{i-1} m_z(i-r), \exp(s_{ir})} $$

Because $m_z$ is differentiable, $z$ is trained jointly with all other parameters. Each head $i$ has its own $z^{(i)}$, and an L1 penalty $\sum_{i=1}^{h} z^{(i)}$ encourages shorter spans.

Dynamic Span (Adaptive Computation Time extension)

The span parameter can be made input-dependent: $$z_t = S \cdot \sigma(v \cdot x_t + b)$$ where $v$ and $b$ are learned jointly, making the span vary per token as well as per head.

Role in the Transformer

Adaptive Attention Span modifies the Self-Attention operation within each Multi-Head Attention head. It does not alter the feed-forward, normalisation, or residual components of the Transformer block.

Variants

  • Fixed sparse patterns (Sparse Transformer) — discrete factorised patterns rather than a continuous learned span; see Sparse Attention.
  • ALiBi — distance bias rather than masking; see ALiBi.

Key Papers

  • Sukhbaatar et al. (2019), "Adaptive Attention Span in Transformers"

Notes

Experiments showed that lower Transformer layers tend to learn short spans while a small number of higher-layer heads learn very long spans. Adaptive Attention Span reduced FLOPs significantly on large models with long context, validating the hypothesis that not all heads need equal reach. The approach also provided interpretability: visualising $z^{(i)}$ per head reveals which heads specialize in long-range vs. local dependencies.