Autoregressive Generation
Autoregressive generation is the inference process whereby a language model iteratively produces one token at a time, appending each output to its input context to condition the next prediction.
type: concept title: "Autoregressive Generation" tags: [language-modeling, generation, inference, decoding] related: ["Masked Self-Attention", "GPT-2", "Decoder-Only Transformer", "Softmax Temperature"] created: 2025-01-01 source: "https://jalammar.github.io/illustrated-gpt2/"
Autoregressive Generation
Summary
Autoregressive generation is the inference procedure by which a language model produces a sequence one token at a time, feeding each newly generated token back into the model as part of the input for the next prediction step.
How It Works
Given a starting context (which may be a single start-of-sequence token or a prompt), the model:
- Runs a forward pass over the current input sequence.
- Takes the output vector at the final (rightmost) position.
- Projects it through the vocabulary matrix to obtain Logits over all vocabulary tokens.
- Converts logits to a probability distribution via softmax.
- Samples or selects a token from this distribution.
- Appends the chosen token to the sequence.
- Repeats from step 1 until an end-of-sequence token is produced or a maximum length is reached.
Generation continues until the context window (e.g., 1,024 tokens for GPT-2) is filled or a stop condition is met.
Decoding strategies: The choice of how to select a token from the probability distribution significantly affects output quality and diversity:
- Greedy (top-k = 1): Always select the highest-probability token. Tends toward repetitive loops.
- Top-k sampling: Sample from the k highest-probability tokens. GPT-2 defaults to k=40.
- Temperature scaling: Apply Softmax Temperature to sharpen or flatten the distribution before sampling.
- Nucleus (top-p) sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p.
Role in the Transformer
Autoregressive generation is the standard inference mode for decoder-only architectures such as GPT-2. The Masked Self-Attention mechanism enforces the causal constraint at training time that makes autoregressive generation valid at inference time: since each token was trained to predict only from its left context, the model's probability estimates remain consistent when used token-by-token.
At training time, the entire sequence is processed in parallel (using the causal mask to prevent information leakage), making training far more efficient than sequential RNN processing. At inference time, KV caching allows efficient sequential generation by storing previously computed key and value vectors rather than recomputing them on each step.
Variants
- Teacher forcing (training): During training, ground-truth tokens are always used as the previous context, rather than the model's own predictions. This is the standard procedure and differs from inference-time behavior.
- Beam search: Maintains multiple candidate sequences in parallel, selecting the highest-probability complete sequence. Common in translation but less so in open-ended generation.
- Non-autoregressive generation: Produces all output tokens in parallel (e.g., masked diffusion models); trades off quality for speed.