Attention Is All You Need, The 2017 paper

A close reading of Vaswani et al. 2017: encoder–decoder transformers built for translation. Self-attention, multi-head attention, cross-attention, and the two-tower architecture that started everything.

Updated 24 days ago

Attention Is All You Need, the 2017 paper

This course is a close reading of Vaswani et al. (2017), Attention Is All You Need, the paper that introduced the encoder–decoder transformer for machine translation. We focus on what the paper actually proposed: self-attention, multi-head attention, cross-attention, and the two-tower architecture in which an encoder reads the source sequence and a decoder generates the target while attending to encoder memory.

For the modern decoder-only world (GPT-2, GPT-3, the chat models you use every day), see the companion course Build and Train Your Own GPT-2, which builds a decoder-only transformer end-to-end in PyTorch.

Breakdown: Attention Is All You Need

Course mapHover any lesson to see why it matters

Prerequisites

Deep Sequence Modelling — RNN

Recurrence, hidden state, BPTT, sequence-to-sequence

Deep Neural Networks

Layers, backpropagation, softmax, matrix multiplication

Lessons

01Intermediate

Problem with RNNs and LSTMs

RNNs are serial, slow, and bottle-neck distant context

02Intermediate

Positional embeddings & encodings

Add position vectors so attention knows sequence order

03Intermediate

Attention

Query asks what to read; keys/values provide the answers

04Intermediate

Self-attention

Every token attends to every other in a single parallel pass

05Intermediate

Multi-headed attention

Run h parallel heads; each learns different relationship patterns

06Intermediate

Cross-attention

Decoder queries the encoder — keys and values cross sequences

07Intermediate

The encoder stack

N layers of bidirectional self-attention and FFN — full context

08Advanced

Encoder–decoder transformer

Both stacks together: the complete 2017 architecture

09Intermediate

From encoder–decoder to GPT

BERT, GPT, T5, and why decoder-only won — bridges into miniGPT

Unlocks

Build and Train Your Own GPT-2

Take the decoder-only design and train a language model end-to-end in PyTorch

Vision Transformers (ViT)

Apply the same architecture to image patches

Multimodal Models

Combine vision and language with cross-attention between towers

Test your understanding

Prof is ready

Prof will ask you questions about Attention Is All You Need, The 2017 paper — not explain it. You'll be surprised what you don't know until you have to say it.