Compressive Transformer
Compressive Transformer adds a second compressed memory tier to Transformer-XL, using learnable compression functions and auxiliary losses to preserve salient information from distant past activations.
type: architecture title: "Compressive Transformer" family: "decoder-only" introduced_in: "Compressive Transformers for Long-Range Sequence Modelling" tags: [long-context, memory, compression, recurrence] created: 2023-01-27
Compressive Transformer
Summary
Compressive Transformer (Rae et al., 2019) extends Transformer-XL by adding a second, compressed memory tier. When past activations become old enough to be evicted from the primary memory, they are compressed and stored in a compressed memory buffer rather than discarded, extending the effective temporal range further.
Architecture Type
decoder-only — designed for autoregressive language modeling over long sequences.
Key Design Decisions
- Two memory tiers per layer: A regular FIFO memory of size $m_m$ (as in Transformer-XL) and a compressed memory of size $m_{cm}$, both stored per layer.
- Compression function $f_c : \mathbb{R}^{L \times d} \to \mathbb{R}^{\lfloor L/c \rfloor \times d}$ maps the $L$ oldest activations to $\lfloor L/c \rfloor$ compressed elements with compression rate $c$. Candidate functions include:
- Max/mean pooling (kernel and stride $c$)
- 1D convolution with kernel and stride $c$ (learnable)
- Dilated convolution (learnable; best empirically on EnWik8)
- Most-used memory selection
- Attention ordering: attention weights are retrieved from oldest to newest across compressed memory → regular memory → causally masked current sequence.
- Temporal range: $(m_m + c \cdot m_{cm}) \times N$ tokens with attention cost $O(L^2 + L(m_m + m_{cm}))$.
Auxiliary Training Losses
Two additional losses train the compression function:
-
Auto-encoding loss (lossless objective): measures how well original memories can be reconstructed from compressed memories: $$\mathcal{L}_{ac} = |\text{old_mem}^{(i)} - g(\text{new_cm}^{(i)})|_2$$ where $g$ reverses the compression.
-
Attention-reconstruction loss (lossy objective): minimises divergence between attention distributions over original vs. compressed memories: $$\mathcal{L}_{ar} = |\text{attn}(h^{(i)}, \text{old_mem}^{(i)}) - \text{attn}(h^{(i)}, \text{new_cm}^{(i)})|_2$$
Training Objective
Autoregressive next-token prediction with two auxiliary compression losses ($\mathcal{L}{ac}$ and $\mathcal{L}{ar}$) added to the main language modeling objective.
Versions / Variants
| Compression function | Notes |
|---|---|
| Max/mean pooling | No extra parameters; simple baseline |
| 1D convolution | Learnable; stride = compression rate |
| Dilated convolution | Best reported performance on EnWik8 |
| Most-used selection | Non-parametric; retains salient activations |
Downstream Capabilities
- Long-document language modeling
- Tasks requiring retention of salient information from very distant context, beyond what Transformer-XL's flat memory supports
Successor Architectures
None directly, but the two-tier memory idea influenced subsequent external memory work.
Key Papers
- Rae et al. (2019), "Compressive Transformers for Long-Range Sequence Modelling"