Transformer
The original encoder-decoder architecture introduced in 2017 that replaced recurrence with self-attention, becoming the foundation of modern LLMs.
type: architecture title: "Transformer" family: "encoder-decoder" introduced_in: "Attention Is All You Need" tags: ["transformer", "encoder-decoder", "attention", "seq2seq", "foundational"] created: 2025-01-01
Transformer
Summary
The Transformer is the encoder-decoder architecture introduced by Vaswani et al. (2017) at Google Brain, designed to perform sequence-to-sequence tasks such as machine translation without using recurrence or convolution. It relies entirely on Self-Attention mechanisms, enabling full parallelization during training.
Architecture Type
encoder-decoder — This choice was driven by the target task of machine translation, where an input sequence (source language) must be encoded into a latent representation and then decoded into an output sequence (target language). The clean separation of encoding and decoding stages, connected via cross-attention, became a reusable template for subsequent architectures.
Key Design Decisions
- Attention-only backbone: Replaces RNNs and CNNs with Self-Attention, eliminating sequential computation bottlenecks and enabling parallelization.
- Multi-Head Attention: Runs multiple attention computations in parallel, each in a lower-dimensional subspace, then concatenates results. Uses 8 heads in the original model.
- Scaled dot-product attention: Divides attention scores by $\sqrt{d_k}$ to prevent vanishing gradients from excessively large dot products.
- Positional Encoding: Adds sinusoidal position vectors to input embeddings since the architecture has no inherent notion of sequence order.
- Layer Normalization: Applied after each sub-layer (post-norm in the original paper), stabilizing training of the deep stack.
- Residual connections: Wrap every sub-layer, allowing gradients to flow directly through the stack.
- Feed-Forward Network: A two-layer MLP with a ReLU activation, applied position-wise (identically and independently to each position). Inner dimension is 2048 in the base model.
- Decoder masking: The decoder's self-attention is causally masked (future positions set to $-\infty$ before softmax) to preserve the autoregressive property during training.
- Encoder-Decoder Attention: A cross-attention layer in each decoder block where queries come from the decoder and keys/values come from the encoder output.
Training Objective
The original Transformer was trained on machine translation using a standard cross-entropy loss against target token sequences, with label smoothing (value 0.1). Training used the Adam optimizer with a custom learning rate schedule featuring a linear warmup followed by inverse square root decay.
Versions / Variants
| Variant | Layers (enc/dec) | $d_{\text{model}}$ | Heads | $d_{ff}$ | Parameters |
|---|---|---|---|---|---|
| Transformer (base) | 6 / 6 | 512 | 8 | 2048 | ~65M |
| Transformer (big) | 6 / 6 | 1024 | 16 | 4096 | ~213M |
Downstream Capabilities
- Machine translation (original task)
- Text summarization
- Question answering (via derived architectures)
- Foundation for all major LLM families through architectural descendants
Successor Architectures
- BERT (encoder-only variant, bidirectional pre-training)
- GPT (decoder-only variant, autoregressive language modeling)
- T5 (encoder-decoder variant, unified text-to-text framework)