Universal Transformer
Universal Transformer applies a single weight-shared Transformer block recurrently across all positions for a variable number of steps, controlled by per-token adaptive halting, combining global attention with RNN-like inductive bias.
type: architecture title: "Universal Transformer" family: "encoder-decoder" introduced_in: "Universal Transformers" tags: [recurrence, adaptive-computation, transformer-variant] created: 2023-01-27
Universal Transformer
Summary
Universal Transformer (Dehghani et al., 2019) combines the global receptive field of the Transformer with RNN-style recurrence by repeatedly applying a single shared Transformer block to all positions for a variable number of steps, controlled by an Adaptive Computation Time (ACT) halting mechanism.
Architecture Type
encoder-decoder — the encoder and decoder share the same basic recurrent structure; the decoder additionally attends to the final encoder representation $h^T$.
Key Design Decisions
- Parameter sharing across depth: Unlike a standard Transformer where each layer has independent weights, Universal Transformer uses the same weights for every recurrent step. A fixed-step Universal Transformer is therefore equivalent to a multi-layer Transformer with tied layer weights.
- Adaptive Computation Time (ACT): Each token position is equipped with a dynamic halting mechanism. Once a position's recurrent block halts, it stops updating and copies its current value forward until all positions halt or a step limit is reached.
- 2D positional encoding: To distinguish both sequence position and recurrent step, the positional encoding adds both a position signal and a step signal:
$$ \text{PE}(i, t, \delta) = \begin{cases} \sin!\left(\frac{i}{10000^{2\delta'/d}}\right) \oplus \sin!\left(\frac{t}{10000^{2\delta'/d}}\right) & \text{if } \delta = 2\delta' \ \cos!\left(\frac{i}{10000^{2\delta'/d}}\right) \oplus \cos!\left(\frac{t}{10000^{2\delta'/d}}\right) & \text{if } \delta = 2\delta'+1 \end{cases} $$
- Recurrent update equations (step $t$, all positions in parallel):
$$ A^t = \text{LayerNorm}(h^{t-1} + \text{MultiHeadAttention}(h^{t-1} + P^t)) $$ $$ h^t = \text{LayerNorm}(A^{t-1} + \text{Transition}(A^t)) $$
where $\text{Transition}(\cdot)$ is either a separable convolution or a two-layer position-wise feed-forward network with ReLU.
Training Objective
Sequence-to-sequence (e.g., machine translation, question answering) using standard cross-entropy loss plus the ACT time penalty that encourages halting early.
Versions / Variants
| Variant | Notes |
|---|---|
| Fixed-step Universal Transformer | Equivalent to weight-tied multi-layer Transformer |
| Adaptive Universal Transformer | Step count varies per token via ACT |
Downstream Capabilities
- Tasks requiring variable depth of reasoning per token (e.g., algorithmic tasks, question answering)
- Sequence-to-sequence tasks where different parts of the input require different amounts of processing
Successor Architectures
None directly, but the weight-tying and adaptive depth ideas influenced later work on depth-adaptive inference; see depth-adaptive concepts in the efficient Transformer literature.
Key Papers
- Dehghani et al. (2019), "Universal Transformers"