Transformer Architecture

adaptive

Updated May 3, 2026

ALiBi

ALiBi adds a fixed head-specific linear distance penalty to attention scores at every layer, enabling length extrapolation beyond the training context window without any learned positional parameters.

position

encoding

Updated May 3, 2026

Attention Is All You Need

The foundational 2017 paper that introduced the Transformer architecture, replacing recurrence with attention mechanisms for sequence-to-sequence tasks.

seq2seq

Updated May 3, 2026

Autoregressive Generation

Autoregressive generation is the inference process whereby a language model iteratively produces one token at a time, appending each output to its input context to condition the next prediction.

language-modeling

generation

inference

Updated May 3, 2026

Byte Pair Encoding

Byte Pair Encoding is a subword tokenization algorithm that builds a vocabulary by iteratively merging frequent character pairs, producing tokens that are typically sub-word units; GPT-2 uses a byte-level BPE with a 50,000-token vocabulary.

tokenization

vocabulary

subword

Updated May 3, 2026

Compressive Transformer

Compressive Transformer adds a second compressed memory tier to Transformer-XL, using learnable compression functions and auxiliary losses to preserve salient information from distant past activations.

memory

compression

Updated May 3, 2026

Context Window

Context window is the maximum token sequence length a Transformer can process in one forward pass, set at training time.

inference

sequence-length

Updated May 3, 2026

Decoder-Only Transformer

The decoder-only Transformer removes the encoder and cross-attention from the original architecture, leaving a stack of masked-self-attention and feed-forward blocks suited for autoregressive language modeling.

architecture-pattern

decoder

autoregressive

Updated May 3, 2026

Embedding

A learned dense vector that maps a discrete token to a continuous representation in d_model-dimensional space.

glossary

representation

input

Updated May 3, 2026

Feed-Forward Network

The Feed-Forward Network is a two-layer MLP applied independently at each sequence position within a Transformer block, expanding to 4× model dimension before projecting back, providing non-linear per-position transformation after attention.

component

mlp

Updated May 3, 2026

GPT-2

GPT-2 is OpenAI's large-scale decoder-only Transformer trained on 40GB of web text for autoregressive language modeling, notable for its coherent long-form text generation and zero-shot task transfer.

decoder-only

language-modeling

autoregressive

Updated May 3, 2026

Key-Query-Value Projection

The three learned linear projections (Query, Key, Value) that transform input representations into the distinct roles required for scaled dot-product attention.

component

Updated May 3, 2026

kNN-Augmented Language Model

kNN-augmented language models combine a pretrained Transformer LM with nearest-neighbour retrieval over an external key-value datastore, interpolating or gating retrieved token probabilities with the model's own predictions to extend effective context far beyond the training window.

memory

retrieval

Updated May 3, 2026

Layer Normalization

Layer Normalization standardizes each token's feature vector independently of batch and sequence dimensions, applied at each sub-layer of Transformer blocks to stabilize training; GPT-2 uses a pre-norm variant where normalization precedes each sub-layer.

normalization

training-stability

Updated May 3, 2026

Logits

Raw unnormalized scores from the final linear layer of a model, converted to probabilities via softmax.

glossary

output

classification

Updated May 3, 2026

Masked Self-Attention

Masked self-attention restricts each token to attend only to itself and prior positions by zeroing out future-position scores before softmax, enforcing the causal constraint required for autoregressive language modeling.

decoder

autoregressive

Updated May 3, 2026

Multi-Head Attention

Multi-Head Attention runs scaled dot-product attention in parallel across multiple lower-dimensional subspaces, concatenates the results, and projects back to model dimension, enabling the model to capture diverse relational patterns simultaneously.

component

Updated May 3, 2026

Positional Encoding

Updated to cover sinusoidal, learned, relative (Shaw), Transformer-XL, RoPE, ALiBi, and DA-Transformer positional encoding variants with full formulations.

position

encoding

Updated May 3, 2026

Rotary Position Embedding

RoPE encodes position by rotating Query and Key vectors with a block-diagonal rotation matrix, ensuring attention scores depend only on relative position offset.

position

encoding

Updated May 3, 2026

Self-Attention

Self-attention allows each token in a sequence to attend to all other tokens by computing dot-product scores between learned Query and Key projections, then aggregating Value vectors weighted by those scores to produce context-aware token representations.

core-mechanism

Updated May 3, 2026

Softmax Temperature

A scalar that controls the sharpness of softmax distributions; in Transformers, most notably the 1/√d_k scaling in attention that prevents gradient saturation.

softmax

hyperparameter

Updated May 3, 2026

Sparse Attention

Sparse attention restricts each query to a structured subset of key positions, reducing attention complexity from quadratic to sub-quadratic and enabling Transformers to process much longer sequences.

efficiency

Updated May 3, 2026

Transformer

The original encoder-decoder architecture introduced in 2017 that replaced recurrence with self-attention, becoming the foundation of modern LLMs.

encoder-decoder

Updated May 3, 2026

Transformer-XL

Transformer-XL extends the Transformer with segment-level hidden-state recurrence and relative positional encoding, enabling effective attention across multiple segments without quadratic recomputation.