The foundational 2017 paper that introduced the Transformer architecture, replacing recurrence with attention mechanisms for sequence-to-sequence tasks.

type: paper title: "Attention Is All You Need" authors: ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Lukasz Kaiser", "Illia Polosukhin"] year: 2017 venue: "NeurIPS" tags: ["transformer", "attention", "seq2seq", "machine-translation", "foundational"] url: "https://arxiv.org/abs/1706.03762" created: 2025-01-01

Attention Is All You Need

One-Line Summary

Introduced the Transformer architecture, replacing recurrence and convolution entirely with attention mechanisms for sequence-to-sequence tasks.

Abstract (Paraphrased)

Prior state-of-the-art sequence transduction models relied on complex recurrent or convolutional neural networks with encoder-decoder structures. This paper proposed the Transformer, a model architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The model achieved superior translation quality while being significantly more parallelizable and requiring less time to train.

Core Contributions

Introduced the Transformer architecture: an encoder-decoder model built entirely from Self-Attention and feed-forward layers.
Introduced Multi-Head Attention as a mechanism for attending to information from different representation subspaces simultaneously.
Introduced scaled dot-product attention, dividing dot products by the square root of the key dimension to stabilize gradients.
Proposed sinusoidal Positional Encoding to inject sequence order information into the model without recurrence.
Demonstrated that the Transformer outperformed existing best models on WMT 2014 English-to-German and English-to-French translation benchmarks.
Showed that self-attention layers connect all positions in constant sequential operations, compared to O(n) for recurrent layers.

Architecture / Method

The Transformer follows an encoder-decoder structure. The encoder is a stack of six identical layers, each containing two sub-layers: a Multi-Head Attention sub-layer and a Feed-Forward Network sub-layer. Each sub-layer is wrapped with a residual connection followed by Layer Normalization.

The decoder is also a stack of six identical layers, but adds a third sub-layer: a cross-attention layer (Encoder-Decoder Attention) that attends over the encoder's output. The decoder's self-attention sub-layer is masked to prevent positions from attending to subsequent positions, enforcing the autoregressive property.

Attention is formulated as:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Multi-Head Attention runs this computation $h$ times in parallel with different learned projections, concatenates the results, and projects them back to the model dimension. The paper uses $h = 8$ heads and $d_k = d_v = 64$, with model dimension $d_{\text{model}} = 512$.

Positional Encoding uses sine and cosine functions of different frequencies, added directly to the input embeddings at the bottom of both the encoder and decoder stacks.

Key Results

WMT 2014 English-to-German: 28.4 BLEU, surpassing all previously published models including ensembles.
WMT 2014 English-to-French: 41.0 BLEU (single model), outperforming all previous single models at a fraction of the training cost.
Trained the big model in 3.5 days on 8 P100 GPUs, substantially faster than prior recurrent architectures.

Influence

The Transformer became the dominant architecture for NLP and beyond. It directly enabled BERT, GPT, and T5, and is the foundation of virtually all modern large language models. The encoder-only, decoder-only, and encoder-decoder architectural variants all descend from this paper.

Limitations Acknowledged

Attention complexity is quadratic in sequence length, limiting applicability to very long sequences.
The paper focused on machine translation; generalization to other tasks was suggested but not fully explored.
Positional encoding scheme is one of several possible choices; learned encodings were noted as a viable alternative.

Notes

The division by $\sqrt{d_k}$ in scaled dot-product attention is a key practical detail: without it, the dot products grow large in magnitude for high $d_k$, pushing the softmax into regions with extremely small gradients.
The paper stacks exactly six encoder and decoder layers, but acknowledges this is a hyperparameter.
Jay Alammar's "The Illustrated Transformer" (https://jalammar.github.io/illustrated-transformer/) is widely considered the most accessible visual explanation of this paper.