Train a 2-layer MLP on XOR
Background
This is the capstone of the build-nn series: wire every function you have built — linear_forward, relu, sigmoid, mse_loss, the three backwards, and sgd_step — into a complete training loop, and train a tiny MLP on XOR. XOR is the smallest non-linearly-separable problem: a single linear layer cannot solve it, but a 2-layer MLP with a non-linearity in the middle can. If your loop drives the loss down on XOR, it proves every component is correct — gradients flow, the optimiser mutates in place, the loss decreases.
Problem statement
Implement the training-loop body of train_xor(seed, hidden, lr, steps). The architecture and weight init are provided; you write forward → backward → update, repeated steps times, then return the final MSE.
forward: z1 = linear_forward(X, W1, b1); a1 = relu(z1)
z2 = linear_forward(a1, W2, b2); y = sigmoid(z2)
backward: dy = mse_loss_backward(y, Y)
dz2 = sigmoid_backward(dy, y) # pass y (the OUTPUT), not z2
da1, dW2, db2 = linear_backward(dz2, a1, W2)
dz1 = relu_backward(da1, z1)
_, dW1, db1 = linear_backward(dz1, X, W1)
update: sgd_step([W1, b1, W2, b2], [dW1, db1, dW2, db2], lr)
Train full-batch on all 4 XOR examples each step. With the defaults (lr=0.5, steps=5000) the final MSE should land below 0.1.
Input
seed—int: RNG seed for weight init (np.random.default_rng(seed)).hidden—int: hidden-layer width (default 4).lr—float: SGD learning rate (default 0.5).steps—int: number of full-batch updates (default 5000).
Output
Returns a python float: the final MSE loss on the 4 XOR examples after training.
Examples
Example 1 — one iteration of the loop
z1 = X@W1 + b1; a1 = relu(z1); z2 = a1@W2 + b2; y = sigmoid(z2)
dy = mse_loss_backward(y, Y)
dz2 = sigmoid_backward(dy, y) # output y, not pre-activation z2
da1, dW2, db2 = linear_backward(dz2, a1, W2)
dz1 = relu_backward(da1, z1)
_, dW1, db1 = linear_backward(dz1, X, W1)
sgd_step([W1, b1, W2, b2], [dW1, db1, dW2, db2], lr)
Explanation: each backward call returns the gradient at the previous activation, threading from the loss back to the first layer; sgd_step then updates all four parameters in place. Repeat for steps iterations.
Example 2 — the loss converges
train_xor(seed=0, steps=20) -> loss ≈ 0.25 (barely trained)
train_xor(seed=0, steps=5000) -> loss < 0.1 (XOR solved)
Explanation: a loss stuck at ~0.25 means the model is outputting ~0.5 for everything (the not-yet-separated state). The loss only drops below that once the hidden ReLU layer learns a non-linear split; reaching < 0.1 confirms the whole chain is wired correctly.
Constraints
- Architecture:
Linear(2→hidden) → ReLU → Linear(hidden→1) → Sigmoid, MSE loss, full-batch each step. - Backward order:
mse_loss_backward → sigmoid_backward → linear_backward(layer 2) → relu_backward → linear_backward(layer 1). sigmoid_backward's second argument is the sigmoid outputy, not the pre-activationz2.sgd_stepupdates in place; return the final loss as a pythonfloat.- Must reach loss
< 0.1in 5000 steps atlr=0.5, and converge on most seeds.
Notes
- The 0.25 plateau. If the loss flatlines near 0.25 it almost always means the gradient does not flow — most often
sigmoid_backwardwas passedz2instead ofy, or ansgd_steprebind broke the in-place update. - Series finale. This composes problems 01–06 of build-nn. If they are all correct, the only work here is ordering the backward chain — re-derive each
grad_*shape on paper if the loss won't drop.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Returns a python float
- •Loss decreases over training
- •Reaches loss < 0.1 in 5000 steps (the diagnostic — model actually learned XOR)
- •Works across multiple seeds (not seed-dependent)
- •More hidden units → can still solve it (sanity)