25 pages
Adaptive Attention Span
Adaptive Attention Span trains a per-head soft mask parameter that continuously adjusts each attention head's effective context window length, saving computation and revealing specialization across heads.
Updated May 3, 2026
ALiBi
ALiBi adds a fixed head-specific linear distance penalty to attention scores at every layer, enabling length extrapolation beyond the training context window without any learned positional parameters.
Updated May 3, 2026
Attention Is All You Need
The foundational 2017 paper that introduced the Transformer architecture, replacing recurrence with attention mechanisms for sequence-to-sequence tasks.
Updated May 3, 2026
Autoregressive Generation
Autoregressive generation is the inference process whereby a language model iteratively produces one token at a time, appending each output to its input context to condition the next prediction.
Updated May 3, 2026
Byte Pair Encoding
Byte Pair Encoding is a subword tokenization algorithm that builds a vocabulary by iteratively merging frequent character pairs, producing tokens that are typically sub-word units; GPT-2 uses a byte-level BPE with a 50,000-token vocabulary.
Updated May 3, 2026
Compressive Transformer
Compressive Transformer adds a second compressed memory tier to Transformer-XL, using learnable compression functions and auxiliary losses to preserve salient information from distant past activations.
Updated May 3, 2026
Context Window
Context window is the maximum token sequence length a Transformer can process in one forward pass, set at training time.
Updated May 3, 2026
Decoder-Only Transformer
The decoder-only Transformer removes the encoder and cross-attention from the original architecture, leaving a stack of masked-self-attention and feed-forward blocks suited for autoregressive language modeling.
Updated May 3, 2026
Embedding
A learned dense vector that maps a discrete token to a continuous representation in d_model-dimensional space.
Updated May 3, 2026
Feed-Forward Network
The Feed-Forward Network is a two-layer MLP applied independently at each sequence position within a Transformer block, expanding to 4× model dimension before projecting back, providing non-linear per-position transformation after attention.
Updated May 3, 2026
GPT-2
GPT-2 is OpenAI's large-scale decoder-only Transformer trained on 40GB of web text for autoregressive language modeling, notable for its coherent long-form text generation and zero-shot task transfer.
Updated May 3, 2026
Key-Query-Value Projection
The three learned linear projections (Query, Key, Value) that transform input representations into the distinct roles required for scaled dot-product attention.
Updated May 3, 2026
kNN-Augmented Language Model
kNN-augmented language models combine a pretrained Transformer LM with nearest-neighbour retrieval over an external key-value datastore, interpolating or gating retrieved token probabilities with the model's own predictions to extend effective context far beyond the training window.
Updated May 3, 2026
Layer Normalization
Layer Normalization standardizes each token's feature vector independently of batch and sequence dimensions, applied at each sub-layer of Transformer blocks to stabilize training; GPT-2 uses a pre-norm variant where normalization precedes each sub-layer.
Updated May 3, 2026
Logits
Raw unnormalized scores from the final linear layer of a model, converted to probabilities via softmax.
Updated May 3, 2026
Masked Self-Attention
Masked self-attention restricts each token to attend only to itself and prior positions by zeroing out future-position scores before softmax, enforcing the causal constraint required for autoregressive language modeling.
Updated May 3, 2026
Multi-Head Attention
Multi-Head Attention runs scaled dot-product attention in parallel across multiple lower-dimensional subspaces, concatenates the results, and projects back to model dimension, enabling the model to capture diverse relational patterns simultaneously.
Updated May 3, 2026
Positional Encoding
Updated to cover sinusoidal, learned, relative (Shaw), Transformer-XL, RoPE, ALiBi, and DA-Transformer positional encoding variants with full formulations.
Updated May 3, 2026
Rotary Position Embedding
RoPE encodes position by rotating Query and Key vectors with a block-diagonal rotation matrix, ensuring attention scores depend only on relative position offset.
Updated May 3, 2026
Self-Attention
Self-attention allows each token in a sequence to attend to all other tokens by computing dot-product scores between learned Query and Key projections, then aggregating Value vectors weighted by those scores to produce context-aware token representations.
Updated May 3, 2026
Softmax Temperature
A scalar that controls the sharpness of softmax distributions; in Transformers, most notably the 1/√d_k scaling in attention that prevents gradient saturation.
Updated May 3, 2026
Sparse Attention
Sparse attention restricts each query to a structured subset of key positions, reducing attention complexity from quadratic to sub-quadratic and enabling Transformers to process much longer sequences.
Updated May 3, 2026
Transformer
The original encoder-decoder architecture introduced in 2017 that replaced recurrence with self-attention, becoming the foundation of modern LLMs.
Updated May 3, 2026
Transformer-XL
Transformer-XL extends the Transformer with segment-level hidden-state recurrence and relative positional encoding, enabling effective attention across multiple segments without quadratic recomputation.
Updated May 3, 2026
Universal Transformer
Universal Transformer applies a single weight-shared Transformer block recurrently across all positions for a variable number of steps, controlled by per-token adaptive halting, combining global attention with RNN-like inductive bias.
Updated May 3, 2026