Byte Pair Encoding is a subword tokenization algorithm that builds a vocabulary by iteratively merging frequent character pairs, producing tokens that are typically sub-word units; GPT-2 uses a byte-level BPE with a 50,000-token vocabulary.

type: concept title: "Byte Pair Encoding" tags: [tokenization, vocabulary, subword] related: ["GPT-2", "Embedding"] created: 2025-01-01 source: "https://jalammar.github.io/illustrated-gpt2/"

Byte Pair Encoding

Summary

Byte Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent adjacent byte or character pairs in a corpus to build a vocabulary of variable-length tokens, balancing vocabulary size against coverage of rare and unseen words.

How It Works

Starting from a base vocabulary of individual characters (or bytes), BPE:

Counts all adjacent symbol pairs in the training corpus.
Merges the most frequent pair into a new symbol.
Repeats until the desired vocabulary size is reached.

The result is a vocabulary where common words appear as single tokens, common subwords (prefixes, suffixes, stems) appear as tokens, and rare or novel words are decomposed into smaller units. This means tokens are typically parts of words rather than whole words.

GPT-2 uses BPE with a vocabulary of 50,000 tokens, operating at the byte level (so any UTF-8 text can be encoded without an "unknown" token).

Role in the Transformer

BPE tokenization is applied as a pre-processing step before any Transformer computation. Each token in the resulting sequence is mapped to a dense Embedding vector via the model's embedding matrix. The choice of tokenization strategy directly affects vocabulary size, sequence length for a given input, and the model's ability to handle morphologically rich or multilingual text.

Variants

WordPiece: Used by BERT; similar merge-based approach but uses a likelihood criterion rather than raw frequency for merges.
SentencePiece: A language-agnostic subword tokenizer that can implement BPE or unigram language model tokenization directly on raw text without pre-tokenization.
Unigram Language Model tokenization: Used in some T5 variants; probabilistically selects the most likely segmentation.
Character-level tokenization: Treats each character as a token; avoids out-of-vocabulary issues but produces very long sequences.

Key Papers

Neural Machine Translation of Rare Words with Subword Units