The original encoder-decoder architecture introduced in 2017 that replaced recurrence with self-attention, becoming the foundation of modern LLMs.

type: architecture title: "Transformer" family: "encoder-decoder" introduced_in: "Attention Is All You Need" tags: ["transformer", "encoder-decoder", "attention", "seq2seq", "foundational"] created: 2025-01-01

Transformer

Summary

The Transformer is the encoder-decoder architecture introduced by Vaswani et al. (2017) at Google Brain, designed to perform sequence-to-sequence tasks such as machine translation without using recurrence or convolution. It relies entirely on Self-Attention mechanisms, enabling full parallelization during training.

Architecture Type

encoder-decoder — This choice was driven by the target task of machine translation, where an input sequence (source language) must be encoded into a latent representation and then decoded into an output sequence (target language). The clean separation of encoding and decoding stages, connected via cross-attention, became a reusable template for subsequent architectures.

Key Design Decisions

Attention-only backbone: Replaces RNNs and CNNs with Self-Attention, eliminating sequential computation bottlenecks and enabling parallelization.
Multi-Head Attention: Runs multiple attention computations in parallel, each in a lower-dimensional subspace, then concatenates results. Uses 8 heads in the original model.
Scaled dot-product attention: Divides attention scores by $\sqrt{d_k}$ to prevent vanishing gradients from excessively large dot products.
Positional Encoding: Adds sinusoidal position vectors to input embeddings since the architecture has no inherent notion of sequence order.
Layer Normalization: Applied after each sub-layer (post-norm in the original paper), stabilizing training of the deep stack.
Residual connections: Wrap every sub-layer, allowing gradients to flow directly through the stack.
Feed-Forward Network: A two-layer MLP with a ReLU activation, applied position-wise (identically and independently to each position). Inner dimension is 2048 in the base model.
Decoder masking: The decoder's self-attention is causally masked (future positions set to $-\infty$ before softmax) to preserve the autoregressive property during training.
Encoder-Decoder Attention: A cross-attention layer in each decoder block where queries come from the decoder and keys/values come from the encoder output.

Training Objective

The original Transformer was trained on machine translation using a standard cross-entropy loss against target token sequences, with label smoothing (value 0.1). Training used the Adam optimizer with a custom learning rate schedule featuring a linear warmup followed by inverse square root decay.

Versions / Variants

Variant	Layers (enc/dec)	$d_{\text{model}}$	Heads	$d_{ff}$	Parameters
Transformer (base)	6 / 6	512	8	2048	~65M
Transformer (big)	6 / 6	1024	16	4096	~213M

Downstream Capabilities

Machine translation (original task)
Text summarization
Question answering (via derived architectures)
Foundation for all major LLM families through architectural descendants

Successor Architectures

BERT (encoder-only variant, bidirectional pre-training)
GPT (decoder-only variant, autoregressive language modeling)
T5 (encoder-decoder variant, unified text-to-text framework)

Key Papers

Attention Is All You Need