Transformer Architecture

A comprehensive guide to the transformer architecture, attention mechanisms, and the key papers that shaped modern LLMs.

Public Wiki

25 pages

Adaptive Attention Span

Adaptive Attention Span trains a per-head soft mask parameter that continuously adjusts each attention head's effective context window length, saving computation and revealing specialization across heads.

efficiency
attention
adaptive

Updated May 3, 2026

ALiBi

ALiBi adds a fixed head-specific linear distance penalty to attention scores at every layer, enabling length extrapolation beyond the training context window without any learned positional parameters.

position
encoding
attention

Updated May 3, 2026

Attention Is All You Need

The foundational 2017 paper that introduced the Transformer architecture, replacing recurrence with attention mechanisms for sequence-to-sequence tasks.

transformer
attention
seq2seq

Updated May 3, 2026

Autoregressive Generation

Autoregressive generation is the inference process whereby a language model iteratively produces one token at a time, appending each output to its input context to condition the next prediction.

language-modeling
generation
inference

Updated May 3, 2026

Byte Pair Encoding

Byte Pair Encoding is a subword tokenization algorithm that builds a vocabulary by iteratively merging frequent character pairs, producing tokens that are typically sub-word units; GPT-2 uses a byte-level BPE with a 50,000-token vocabulary.

tokenization
vocabulary
subword

Updated May 3, 2026

Compressive Transformer

Compressive Transformer adds a second compressed memory tier to Transformer-XL, using learnable compression functions and auxiliary losses to preserve salient information from distant past activations.

long-context
memory
compression

Updated May 3, 2026

Context Window

Context window is the maximum token sequence length a Transformer can process in one forward pass, set at training time.

inference
sequence-length

Updated May 3, 2026

Decoder-Only Transformer

The decoder-only Transformer removes the encoder and cross-attention from the original architecture, leaving a stack of masked-self-attention and feed-forward blocks suited for autoregressive language modeling.

architecture-pattern
decoder
autoregressive

Updated May 3, 2026

Embedding

A learned dense vector that maps a discrete token to a continuous representation in d_model-dimensional space.

glossary
representation
input

Updated May 3, 2026

Feed-Forward Network

The Feed-Forward Network is a two-layer MLP applied independently at each sequence position within a Transformer block, expanding to 4× model dimension before projecting back, providing non-linear per-position transformation after attention.

component
transformer
mlp

Updated May 3, 2026

GPT-2

GPT-2 is OpenAI's large-scale decoder-only Transformer trained on 40GB of web text for autoregressive language modeling, notable for its coherent long-form text generation and zero-shot task transfer.

decoder-only
language-modeling
autoregressive

Updated May 3, 2026

Key-Query-Value Projection

The three learned linear projections (Query, Key, Value) that transform input representations into the distinct roles required for scaled dot-product attention.

attention
transformer
component

Updated May 3, 2026

kNN-Augmented Language Model

kNN-augmented language models combine a pretrained Transformer LM with nearest-neighbour retrieval over an external key-value datastore, interpolating or gating retrieved token probabilities with the model's own predictions to extend effective context far beyond the training window.

memory
retrieval
long-context

Updated May 3, 2026

Layer Normalization

Layer Normalization standardizes each token's feature vector independently of batch and sequence dimensions, applied at each sub-layer of Transformer blocks to stabilize training; GPT-2 uses a pre-norm variant where normalization precedes each sub-layer.

normalization
training-stability
transformer

Updated May 3, 2026

Logits

Raw unnormalized scores from the final linear layer of a model, converted to probabilities via softmax.

glossary
output
classification

Updated May 3, 2026

Masked Self-Attention

Masked self-attention restricts each token to attend only to itself and prior positions by zeroing out future-position scores before softmax, enforcing the causal constraint required for autoregressive language modeling.

attention
decoder
autoregressive

Updated May 3, 2026

Multi-Head Attention

Multi-Head Attention runs scaled dot-product attention in parallel across multiple lower-dimensional subspaces, concatenates the results, and projects back to model dimension, enabling the model to capture diverse relational patterns simultaneously.

attention
component
transformer

Updated May 3, 2026

Positional Encoding

Updated to cover sinusoidal, learned, relative (Shaw), Transformer-XL, RoPE, ALiBi, and DA-Transformer positional encoding variants with full formulations.

attention
position
encoding

Updated May 3, 2026

Rotary Position Embedding

RoPE encodes position by rotating Query and Key vectors with a block-diagonal rotation matrix, ensuring attention scores depend only on relative position offset.

position
encoding
attention

Updated May 3, 2026

Self-Attention

Self-attention allows each token in a sequence to attend to all other tokens by computing dot-product scores between learned Query and Key projections, then aggregating Value vectors weighted by those scores to produce context-aware token representations.

attention
transformer
core-mechanism

Updated May 3, 2026

Softmax Temperature

A scalar that controls the sharpness of softmax distributions; in Transformers, most notably the 1/√d_k scaling in attention that prevents gradient saturation.

attention
softmax
hyperparameter

Updated May 3, 2026

Sparse Attention

Sparse attention restricts each query to a structured subset of key positions, reducing attention complexity from quadratic to sub-quadratic and enabling Transformers to process much longer sequences.

efficiency
attention
long-context

Updated May 3, 2026

Transformer

The original encoder-decoder architecture introduced in 2017 that replaced recurrence with self-attention, becoming the foundation of modern LLMs.

transformer
encoder-decoder
attention

Updated May 3, 2026

Transformer-XL

Transformer-XL extends the Transformer with segment-level hidden-state recurrence and relative positional encoding, enabling effective attention across multiple segments without quadratic recomputation.

long-context
recurrence
relative-position

Updated May 3, 2026

Universal Transformer

Universal Transformer applies a single weight-shared Transformer block recurrently across all positions for a variable number of steps, controlled by per-token adaptive halting, combining global attention with RNN-like inductive bias.

recurrence
adaptive-computation
transformer-variant

Updated May 3, 2026