Transformer Architecture

Compressive Transformer

Compressive Transformer adds a second compressed memory tier to Transformer-XL, using learnable compression functions and auxiliary losses to preserve salient information from distant past activations.


type: architecture title: "Compressive Transformer" family: "decoder-only" introduced_in: "Compressive Transformers for Long-Range Sequence Modelling" tags: [long-context, memory, compression, recurrence] created: 2023-01-27

Compressive Transformer

Summary

Compressive Transformer (Rae et al., 2019) extends Transformer-XL by adding a second, compressed memory tier. When past activations become old enough to be evicted from the primary memory, they are compressed and stored in a compressed memory buffer rather than discarded, extending the effective temporal range further.

Architecture Type

decoder-only — designed for autoregressive language modeling over long sequences.

Key Design Decisions

  • Two memory tiers per layer: A regular FIFO memory of size $m_m$ (as in Transformer-XL) and a compressed memory of size $m_{cm}$, both stored per layer.
  • Compression function $f_c : \mathbb{R}^{L \times d} \to \mathbb{R}^{\lfloor L/c \rfloor \times d}$ maps the $L$ oldest activations to $\lfloor L/c \rfloor$ compressed elements with compression rate $c$. Candidate functions include:
    • Max/mean pooling (kernel and stride $c$)
    • 1D convolution with kernel and stride $c$ (learnable)
    • Dilated convolution (learnable; best empirically on EnWik8)
    • Most-used memory selection
  • Attention ordering: attention weights are retrieved from oldest to newest across compressed memory → regular memory → causally masked current sequence.
  • Temporal range: $(m_m + c \cdot m_{cm}) \times N$ tokens with attention cost $O(L^2 + L(m_m + m_{cm}))$.

Auxiliary Training Losses

Two additional losses train the compression function:

  1. Auto-encoding loss (lossless objective): measures how well original memories can be reconstructed from compressed memories: $$\mathcal{L}_{ac} = |\text{old_mem}^{(i)} - g(\text{new_cm}^{(i)})|_2$$ where $g$ reverses the compression.

  2. Attention-reconstruction loss (lossy objective): minimises divergence between attention distributions over original vs. compressed memories: $$\mathcal{L}_{ar} = |\text{attn}(h^{(i)}, \text{old_mem}^{(i)}) - \text{attn}(h^{(i)}, \text{new_cm}^{(i)})|_2$$

Training Objective

Autoregressive next-token prediction with two auxiliary compression losses ($\mathcal{L}{ac}$ and $\mathcal{L}{ar}$) added to the main language modeling objective.

Versions / Variants

Compression functionNotes
Max/mean poolingNo extra parameters; simple baseline
1D convolutionLearnable; stride = compression rate
Dilated convolutionBest reported performance on EnWik8
Most-used selectionNon-parametric; retains salient activations

Downstream Capabilities

  • Long-document language modeling
  • Tasks requiring retention of salient information from very distant context, beyond what Transformer-XL's flat memory supports

Successor Architectures

None directly, but the two-tier memory idea influenced subsequent external memory work.

Key Papers

  • Rae et al. (2019), "Compressive Transformers for Long-Range Sequence Modelling"