GPT-2 is OpenAI's large-scale decoder-only Transformer trained on 40GB of web text for autoregressive language modeling, notable for its coherent long-form text generation and zero-shot task transfer.

type: architecture title: "GPT-2" family: "decoder-only" introduced_in: "Language Models are Unsupervised Multitask Learners" tags: [decoder-only, language-modeling, autoregressive, openai] created: 2025-01-01

GPT-2

Summary

GPT-2 is a large-scale decoder-only Transformer language model developed by OpenAI, trained on 40GB of internet text (WebText). It was designed to demonstrate that a sufficiently large autoregressive language model trained on diverse web data could perform impressively across many tasks without task-specific fine-tuning.

Architecture Type

decoder-only — GPT-2 uses a stack of Transformer decoder-only blocks (sometimes called the "decoder-only block" variant), discarding both the encoder and the encoder-decoder cross-attention layer present in the original Transformer. This choice suits autoregressive next-token prediction, where the model generates one token at a time and must not attend to future tokens.

Key Design Decisions

Decoder-only blocks: Stacks transformer blocks that use Masked Self-Attention rather than bidirectional Self-Attention, preventing any position from attending to future tokens.
No cross-attention: Unlike the full Transformer, there is no second attention sublayer attending to encoder outputs.
Context window of 1024 tokens: A significant expansion over the 512-token limit of the original Transformer.
Positional Encoding: Uses a learned positional encoding matrix with one vector per position (up to 1024), added to token Embedding vectors before the first block.
Feed-Forward Network sizing: The inner feed-forward layer is 4× the model dimension (e.g., 3072 units for GPT-2 small with d_model=768), following the convention established in the original Transformer.
Layer Normalization: Applied throughout the transformer blocks to stabilize training.
Byte Pair Encoding tokenization: Vocabulary of 50,000 tokens representing sub-word units.
Autoregressive inference with KV caching: During evaluation, key and value vectors from previously processed tokens are cached per layer so they do not need to be recomputed on each new token generation step.
top-k sampling: At generation time, the model can sample from the top-k highest-probability tokens rather than always taking the argmax, producing more varied outputs.

Training Objective

Next-token prediction (causal language modeling): the model is trained to predict the next token given all preceding tokens in a sequence. This is a standard autoregressive objective. At training time, the model processes sequences of up to 1024 tokens in batches of 512.

Versions / Variants

Variant	Parameters	Layers	d_model	Attention Heads
GPT-2 Small	~117M (published) / ~124M (counted)	12	768	12
GPT-2 Medium	~345M	24	1024	16
GPT-2 Large	~762M	36	1280	20
GPT-2 XL	~1542M	48	1600	25

Downstream Capabilities

Text generation: Coherent long-form prose generation, demonstrated memorably via the "AI Dungeon"-style unconditional and conditional sampling.
Summarization: When prompted with an article body, GPT-2 can generate summary-like continuations.
Machine translation: Decoder-only models can perform translation when given a suitably formatted prompt.
Zero-shot task transfer: Because the model was trained on diverse web text, it exhibits surprising zero-shot ability on tasks it was not explicitly trained for.
Music generation (architectural pattern): The same decoder-only architecture used in GPT-2 was applied to music generation in Music Transformer, representing notes and timing as discrete tokens.

Successor Architectures

GPT-3
LLaMA

Key Papers

Language Models are Unsupervised Multitask Learners