Transformer Architecture

Transformer

The original encoder-decoder architecture introduced in 2017 that replaced recurrence with self-attention, becoming the foundation of modern LLMs.


type: architecture title: "Transformer" family: "encoder-decoder" introduced_in: "Attention Is All You Need" tags: ["transformer", "encoder-decoder", "attention", "seq2seq", "foundational"] created: 2025-01-01

Transformer

Summary

The Transformer is the encoder-decoder architecture introduced by Vaswani et al. (2017) at Google Brain, designed to perform sequence-to-sequence tasks such as machine translation without using recurrence or convolution. It relies entirely on Self-Attention mechanisms, enabling full parallelization during training.

Architecture Type

encoder-decoder — This choice was driven by the target task of machine translation, where an input sequence (source language) must be encoded into a latent representation and then decoded into an output sequence (target language). The clean separation of encoding and decoding stages, connected via cross-attention, became a reusable template for subsequent architectures.

Key Design Decisions

  • Attention-only backbone: Replaces RNNs and CNNs with Self-Attention, eliminating sequential computation bottlenecks and enabling parallelization.
  • Multi-Head Attention: Runs multiple attention computations in parallel, each in a lower-dimensional subspace, then concatenates results. Uses 8 heads in the original model.
  • Scaled dot-product attention: Divides attention scores by $\sqrt{d_k}$ to prevent vanishing gradients from excessively large dot products.
  • Positional Encoding: Adds sinusoidal position vectors to input embeddings since the architecture has no inherent notion of sequence order.
  • Layer Normalization: Applied after each sub-layer (post-norm in the original paper), stabilizing training of the deep stack.
  • Residual connections: Wrap every sub-layer, allowing gradients to flow directly through the stack.
  • Feed-Forward Network: A two-layer MLP with a ReLU activation, applied position-wise (identically and independently to each position). Inner dimension is 2048 in the base model.
  • Decoder masking: The decoder's self-attention is causally masked (future positions set to $-\infty$ before softmax) to preserve the autoregressive property during training.
  • Encoder-Decoder Attention: A cross-attention layer in each decoder block where queries come from the decoder and keys/values come from the encoder output.

Training Objective

The original Transformer was trained on machine translation using a standard cross-entropy loss against target token sequences, with label smoothing (value 0.1). Training used the Adam optimizer with a custom learning rate schedule featuring a linear warmup followed by inverse square root decay.

Versions / Variants

VariantLayers (enc/dec)$d_{\text{model}}$Heads$d_{ff}$Parameters
Transformer (base)6 / 651282048~65M
Transformer (big)6 / 61024164096~213M

Downstream Capabilities

  • Machine translation (original task)
  • Text summarization
  • Question answering (via derived architectures)
  • Foundation for all major LLM families through architectural descendants

Successor Architectures

  • BERT (encoder-only variant, bidirectional pre-training)
  • GPT (decoder-only variant, autoregressive language modeling)
  • T5 (encoder-decoder variant, unified text-to-text framework)

Key Papers