Transformer Architecture

GPT-2

GPT-2 is OpenAI's large-scale decoder-only Transformer trained on 40GB of web text for autoregressive language modeling, notable for its coherent long-form text generation and zero-shot task transfer.


type: architecture title: "GPT-2" family: "decoder-only" introduced_in: "Language Models are Unsupervised Multitask Learners" tags: [decoder-only, language-modeling, autoregressive, openai] created: 2025-01-01

GPT-2

Summary

GPT-2 is a large-scale decoder-only Transformer language model developed by OpenAI, trained on 40GB of internet text (WebText). It was designed to demonstrate that a sufficiently large autoregressive language model trained on diverse web data could perform impressively across many tasks without task-specific fine-tuning.

Architecture Type

decoder-only — GPT-2 uses a stack of Transformer decoder-only blocks (sometimes called the "decoder-only block" variant), discarding both the encoder and the encoder-decoder cross-attention layer present in the original Transformer. This choice suits autoregressive next-token prediction, where the model generates one token at a time and must not attend to future tokens.

Key Design Decisions

  • Decoder-only blocks: Stacks transformer blocks that use Masked Self-Attention rather than bidirectional Self-Attention, preventing any position from attending to future tokens.
  • No cross-attention: Unlike the full Transformer, there is no second attention sublayer attending to encoder outputs.
  • Context window of 1024 tokens: A significant expansion over the 512-token limit of the original Transformer.
  • Positional Encoding: Uses a learned positional encoding matrix with one vector per position (up to 1024), added to token Embedding vectors before the first block.
  • Feed-Forward Network sizing: The inner feed-forward layer is 4× the model dimension (e.g., 3072 units for GPT-2 small with d_model=768), following the convention established in the original Transformer.
  • Layer Normalization: Applied throughout the transformer blocks to stabilize training.
  • Byte Pair Encoding tokenization: Vocabulary of 50,000 tokens representing sub-word units.
  • Autoregressive inference with KV caching: During evaluation, key and value vectors from previously processed tokens are cached per layer so they do not need to be recomputed on each new token generation step.
  • top-k sampling: At generation time, the model can sample from the top-k highest-probability tokens rather than always taking the argmax, producing more varied outputs.

Training Objective

Next-token prediction (causal language modeling): the model is trained to predict the next token given all preceding tokens in a sequence. This is a standard autoregressive objective. At training time, the model processes sequences of up to 1024 tokens in batches of 512.

Versions / Variants

VariantParametersLayersd_modelAttention Heads
GPT-2 Small~117M (published) / ~124M (counted)1276812
GPT-2 Medium~345M24102416
GPT-2 Large~762M36128020
GPT-2 XL~1542M48160025

Downstream Capabilities

  • Text generation: Coherent long-form prose generation, demonstrated memorably via the "AI Dungeon"-style unconditional and conditional sampling.
  • Summarization: When prompted with an article body, GPT-2 can generate summary-like continuations.
  • Machine translation: Decoder-only models can perform translation when given a suitably formatted prompt.
  • Zero-shot task transfer: Because the model was trained on diverse web text, it exhibits surprising zero-shot ability on tasks it was not explicitly trained for.
  • Music generation (architectural pattern): The same decoder-only architecture used in GPT-2 was applied to music generation in Music Transformer, representing notes and timing as discrete tokens.

Successor Architectures

Key Papers