Attention Is All You Need, the 2017 paper
This course is a close reading of Vaswani et al. (2017), Attention Is All You Need, the paper that introduced the encoder–decoder transformer for machine translation. We focus on what the paper actually proposed: self-attention, multi-head attention, cross-attention, and the two-tower architecture in which an encoder reads the source sequence and a decoder generates the target while attending to encoder memory.
For the modern decoder-only world (GPT-2, GPT-3, the chat models you use every day), see the companion course Build and Train Your Own GPT-2, which builds a decoder-only transformer end-to-end in PyTorch.
Breakdown: Attention Is All You Need
Prerequisites
Deep Sequence Modelling — RNN
Recurrence, hidden state, BPTT, sequence-to-sequence
Deep Neural Networks
Layers, backpropagation, softmax, matrix multiplication
Lessons
Problem with RNNs and LSTMs
RNNs are serial, slow, and bottle-neck distant context
Positional embeddings & encodings
Add position vectors so attention knows sequence order
Attention
Query asks what to read; keys/values provide the answers
Self-attention
Every token attends to every other in a single parallel pass
Multi-headed attention
Run h parallel heads; each learns different relationship patterns
Cross-attention
Decoder queries the encoder — keys and values cross sequences
The encoder stack
N layers of bidirectional self-attention and FFN — full context
Encoder–decoder transformer
Both stacks together: the complete 2017 architecture
From encoder–decoder to GPT
BERT, GPT, T5, and why decoder-only won — bridges into miniGPT
Unlocks
Build and Train Your Own GPT-2
Take the decoder-only design and train a language model end-to-end in PyTorch
Vision Transformers (ViT)
Apply the same architecture to image patches
Multimodal Models
Combine vision and language with cross-attention between towers