Transformer-XL extends the Transformer with segment-level hidden-state recurrence and relative positional encoding, enabling effective attention across multiple segments without quadratic recomputation.

type: architecture title: "Transformer-XL" family: "decoder-only" introduced_in: "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context" tags: [long-context, recurrence, relative-position, language-modeling] created: 2023-01-27

Transformer-XL

Summary

Transformer-XL (Dai et al., 2019; "XL" = "extra long") extends the vanilla Transformer for language modeling by introducing a segment-level recurrence mechanism that reuses hidden states from previous segments. This allows attention to span multiple segments without reprocessing past tokens from scratch, greatly extending the effective context window.

Architecture Type

decoder-only — designed for language modeling, where each position can only attend to past context.

Key Design Decisions

Segment-level recurrence: Hidden states from the previous segment $\tau$ are cached (with a stop-gradient) and concatenated with the current segment's hidden states before computing keys and values at each layer. Queries use only current-segment hidden states.
Relative positional encoding: Because reused hidden states from different segments would receive identical absolute position encodings, Transformer-XL replaces absolute encoding with a learnable relative encoding scheme decomposed into four terms (content-based addressing, content-dependent positional bias, global content bias, global positional bias); see Positional Encoding.
Extended attention span: With memory size $m$ and $N$ layers, the effective temporal range is $m \times N$ tokens. Attention cost is $O(L^2 + Lm)$.

Recurrence Equations

For the $(\tau+1)$-th segment, $n$-th layer hidden state $h^{(n)}_{\tau+1} \in \mathbb{R}^{L \times d}$:

$$ \tilde{h}^{(n-1)}{\tau+1} = [\text{stop-gradient}(h^{(n-1)}\tau) \circ h^{(n-1)}_{\tau+1}] $$

$$ Q^{(n)}{\tau+1} = h^{(n-1)}{\tau+1} W^q, \quad K^{(n)}{\tau+1} = \tilde{h}^{(n-1)}{\tau+1} W^k, \quad V^{(n)}{\tau+1} = \tilde{h}^{(n-1)}{\tau+1} W^v $$

$$ h^{(n)}{\tau+1} = \text{transformer-layer}(Q^{(n)}{\tau+1},, K^{(n)}{\tau+1},, V^{(n)}{\tau+1}) $$

The concatenation $[\cdot \circ \cdot]$ is along the sequence-length dimension.

Training Objective

Autoregressive next-token prediction (language modeling), same as a standard decoder-only Transformer.

Versions / Variants

Variant	Notes
Transformer-XL (base)	Segment recurrence + relative positional encoding (Dai et al., 2019)
Compressive Transformer	Extends Transformer-XL by adding a compressed memory tier for older activations (Rae et al., 2019); see Compressive Transformer

Downstream Capabilities

Language modeling on long documents
Tasks requiring dependencies that exceed a single fixed-length context window
Efficient evaluation: the recurrence mechanism avoids redundant recomputation when sliding the context window

Successor Architectures

Compressive Transformer — adds lossy compression of old memories on top of the Transformer-XL recurrence scheme.

Key Papers

Dai et al. (2019), "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"