This course walks through every component of GPT-2, embeddings, causal attention, residuals, layer norm, feed-forward blocks, and assembles them into a decoder-only transformer trained end-to-end.
GPT-2: The Model That Was Too Dangerous to Release
Prerequisites
Attention Is All You Need
All transformer components: embeddings, attention, layer norm, FFN, encoder/decoder stacks
Deep Sequence Modelling — RNN
Sequence generation, next-token prediction, loss over time steps
Lessons
Problem with RNNs and LSTMs
RNNs cannot scale to billion-token contexts
Token embeddings
Vocabulary IDs → learned d-model-dimensional vectors
Positional embeddings
Add sinusoidal position codes so attention knows token order
Attention & multi-head attention
12 parallel heads, each 64-dimensional, learning different patterns
Causal masking
Upper-triangular mask forces left-to-right generation
Residual connections
Add layer input to output — gradients flow through 12 stacked blocks
Layer normalization
Pre-norm per token stabilises activations through 12 layers
Feed-forward neural networks
Expand 4× to 3072, apply GELU, contract back to 768
Generation of next tokens
Sequence → logits → sample: greedy, top-k, top-p, temperature
Decoder-only transformer
Stack 12 masked-attention + FFN blocks — no encoder needed
Unlocks
Fine-tuning GPT-2 on custom data
Adapt pretrained weights to your domain or task with SFT
RLHF & alignment
Reward modelling and PPO to align generation with human preferences
Inference optimisation & deployment
KV-cache, quantisation, and serving LLMs at scale