Transformer Architecture

Universal Transformer

Universal Transformer applies a single weight-shared Transformer block recurrently across all positions for a variable number of steps, controlled by per-token adaptive halting, combining global attention with RNN-like inductive bias.


type: architecture title: "Universal Transformer" family: "encoder-decoder" introduced_in: "Universal Transformers" tags: [recurrence, adaptive-computation, transformer-variant] created: 2023-01-27

Universal Transformer

Summary

Universal Transformer (Dehghani et al., 2019) combines the global receptive field of the Transformer with RNN-style recurrence by repeatedly applying a single shared Transformer block to all positions for a variable number of steps, controlled by an Adaptive Computation Time (ACT) halting mechanism.

Architecture Type

encoder-decoder — the encoder and decoder share the same basic recurrent structure; the decoder additionally attends to the final encoder representation $h^T$.

Key Design Decisions

  • Parameter sharing across depth: Unlike a standard Transformer where each layer has independent weights, Universal Transformer uses the same weights for every recurrent step. A fixed-step Universal Transformer is therefore equivalent to a multi-layer Transformer with tied layer weights.
  • Adaptive Computation Time (ACT): Each token position is equipped with a dynamic halting mechanism. Once a position's recurrent block halts, it stops updating and copies its current value forward until all positions halt or a step limit is reached.
  • 2D positional encoding: To distinguish both sequence position and recurrent step, the positional encoding adds both a position signal and a step signal:

$$ \text{PE}(i, t, \delta) = \begin{cases} \sin!\left(\frac{i}{10000^{2\delta'/d}}\right) \oplus \sin!\left(\frac{t}{10000^{2\delta'/d}}\right) & \text{if } \delta = 2\delta' \ \cos!\left(\frac{i}{10000^{2\delta'/d}}\right) \oplus \cos!\left(\frac{t}{10000^{2\delta'/d}}\right) & \text{if } \delta = 2\delta'+1 \end{cases} $$

  • Recurrent update equations (step $t$, all positions in parallel):

$$ A^t = \text{LayerNorm}(h^{t-1} + \text{MultiHeadAttention}(h^{t-1} + P^t)) $$ $$ h^t = \text{LayerNorm}(A^{t-1} + \text{Transition}(A^t)) $$

where $\text{Transition}(\cdot)$ is either a separable convolution or a two-layer position-wise feed-forward network with ReLU.

Training Objective

Sequence-to-sequence (e.g., machine translation, question answering) using standard cross-entropy loss plus the ACT time penalty that encourages halting early.

Versions / Variants

VariantNotes
Fixed-step Universal TransformerEquivalent to weight-tied multi-layer Transformer
Adaptive Universal TransformerStep count varies per token via ACT

Downstream Capabilities

  • Tasks requiring variable depth of reasoning per token (e.g., algorithmic tasks, question answering)
  • Sequence-to-sequence tasks where different parts of the input require different amounts of processing

Successor Architectures

None directly, but the weight-tying and adaptive depth ideas influenced later work on depth-adaptive inference; see depth-adaptive concepts in the efficient Transformer literature.

Key Papers

  • Dehghani et al. (2019), "Universal Transformers"