GPT-2
GPT-2 is OpenAI's large-scale decoder-only Transformer trained on 40GB of web text for autoregressive language modeling, notable for its coherent long-form text generation and zero-shot task transfer.
type: architecture title: "GPT-2" family: "decoder-only" introduced_in: "Language Models are Unsupervised Multitask Learners" tags: [decoder-only, language-modeling, autoregressive, openai] created: 2025-01-01
GPT-2
Summary
GPT-2 is a large-scale decoder-only Transformer language model developed by OpenAI, trained on 40GB of internet text (WebText). It was designed to demonstrate that a sufficiently large autoregressive language model trained on diverse web data could perform impressively across many tasks without task-specific fine-tuning.
Architecture Type
decoder-only — GPT-2 uses a stack of Transformer decoder-only blocks (sometimes called the "decoder-only block" variant), discarding both the encoder and the encoder-decoder cross-attention layer present in the original Transformer. This choice suits autoregressive next-token prediction, where the model generates one token at a time and must not attend to future tokens.
Key Design Decisions
- Decoder-only blocks: Stacks transformer blocks that use Masked Self-Attention rather than bidirectional Self-Attention, preventing any position from attending to future tokens.
- No cross-attention: Unlike the full Transformer, there is no second attention sublayer attending to encoder outputs.
- Context window of 1024 tokens: A significant expansion over the 512-token limit of the original Transformer.
- Positional Encoding: Uses a learned positional encoding matrix with one vector per position (up to 1024), added to token Embedding vectors before the first block.
- Feed-Forward Network sizing: The inner feed-forward layer is 4× the model dimension (e.g., 3072 units for GPT-2 small with d_model=768), following the convention established in the original Transformer.
- Layer Normalization: Applied throughout the transformer blocks to stabilize training.
- Byte Pair Encoding tokenization: Vocabulary of 50,000 tokens representing sub-word units.
- Autoregressive inference with KV caching: During evaluation, key and value vectors from previously processed tokens are cached per layer so they do not need to be recomputed on each new token generation step.
- top-k sampling: At generation time, the model can sample from the top-k highest-probability tokens rather than always taking the argmax, producing more varied outputs.
Training Objective
Next-token prediction (causal language modeling): the model is trained to predict the next token given all preceding tokens in a sequence. This is a standard autoregressive objective. At training time, the model processes sequences of up to 1024 tokens in batches of 512.
Versions / Variants
| Variant | Parameters | Layers | d_model | Attention Heads |
|---|---|---|---|---|
| GPT-2 Small | ~117M (published) / ~124M (counted) | 12 | 768 | 12 |
| GPT-2 Medium | ~345M | 24 | 1024 | 16 |
| GPT-2 Large | ~762M | 36 | 1280 | 20 |
| GPT-2 XL | ~1542M | 48 | 1600 | 25 |
Downstream Capabilities
- Text generation: Coherent long-form prose generation, demonstrated memorably via the "AI Dungeon"-style unconditional and conditional sampling.
- Summarization: When prompted with an article body, GPT-2 can generate summary-like continuations.
- Machine translation: Decoder-only models can perform translation when given a suitably formatted prompt.
- Zero-shot task transfer: Because the model was trained on diverse web text, it exhibits surprising zero-shot ability on tasks it was not explicitly trained for.
- Music generation (architectural pattern): The same decoder-only architecture used in GPT-2 was applied to music generation in Music Transformer, representing notes and timing as discrete tokens.