Train a 2-layer MLP on XORHard

Train a 2-layer MLP on XOR

Background

This is the capstone of the build-nn series: wire every function you have built — linear_forward, relu, sigmoid, mse_loss, the three backwards, and sgd_step — into a complete training loop, and train a tiny 2412 \to 4 \to 1 MLP on XOR. XOR is the smallest non-linearly-separable problem: a single linear layer cannot solve it, but a 2-layer MLP with a non-linearity in the middle can. If your loop drives the loss down on XOR, it proves every component is correct — gradients flow, the optimiser mutates in place, the loss decreases.

Problem statement

Implement the training-loop body of train_xor(seed, hidden, lr, steps). The architecture and weight init are provided; you write forward → backward → update, repeated steps times, then return the final MSE.

forward:   z1 = linear_forward(X, W1, b1);  a1 = relu(z1)
           z2 = linear_forward(a1, W2, b2);  y = sigmoid(z2)
backward:  dy  = mse_loss_backward(y, Y)
           dz2 = sigmoid_backward(dy, y)              # pass y (the OUTPUT), not z2
           da1, dW2, db2 = linear_backward(dz2, a1, W2)
           dz1 = relu_backward(da1, z1)
           _,   dW1, db1 = linear_backward(dz1, X, W1)
update:    sgd_step([W1, b1, W2, b2], [dW1, db1, dW2, db2], lr)

Train full-batch on all 4 XOR examples each step. With the defaults (lr=0.5, steps=5000) the final MSE should land below 0.1.

Input

  • seedint: RNG seed for weight init (np.random.default_rng(seed)).
  • hiddenint: hidden-layer width (default 4).
  • lrfloat: SGD learning rate (default 0.5).
  • stepsint: number of full-batch updates (default 5000).

Output

Returns a python float: the final MSE loss on the 4 XOR examples after training.

Examples

Example 1 — one iteration of the loop

z1 = X@W1 + b1;  a1 = relu(z1);  z2 = a1@W2 + b2;  y = sigmoid(z2)
dy  = mse_loss_backward(y, Y)
dz2 = sigmoid_backward(dy, y)               # output y, not pre-activation z2
da1, dW2, db2 = linear_backward(dz2, a1, W2)
dz1 = relu_backward(da1, z1)
_,   dW1, db1 = linear_backward(dz1, X, W1)
sgd_step([W1, b1, W2, b2], [dW1, db1, dW2, db2], lr)

Explanation: each backward call returns the gradient at the previous activation, threading from the loss back to the first layer; sgd_step then updates all four parameters in place. Repeat for steps iterations.

Example 2 — the loss converges

train_xor(seed=0, steps=20)   -> loss ≈ 0.25   (barely trained)
train_xor(seed=0, steps=5000) -> loss  < 0.1   (XOR solved)

Explanation: a loss stuck at ~0.25 means the model is outputting ~0.5 for everything (the not-yet-separated state). The loss only drops below that once the hidden ReLU layer learns a non-linear split; reaching < 0.1 confirms the whole chain is wired correctly.

Constraints

  • Architecture: Linear(2→hidden) → ReLU → Linear(hidden→1) → Sigmoid, MSE loss, full-batch each step.
  • Backward order: mse_loss_backward → sigmoid_backward → linear_backward(layer 2) → relu_backward → linear_backward(layer 1).
  • sigmoid_backward's second argument is the sigmoid output y, not the pre-activation z2.
  • sgd_step updates in place; return the final loss as a python float.
  • Must reach loss < 0.1 in 5000 steps at lr=0.5, and converge on most seeds.

Notes

  • The 0.25 plateau. If the loss flatlines near 0.25 it almost always means the gradient does not flow — most often sigmoid_backward was passed z2 instead of y, or an sgd_step rebind broke the in-place update.
  • Series finale. This composes problems 01–06 of build-nn. If they are all correct, the only work here is ordering the backward chain — re-derive each grad_* shape on paper if the loss won't drop.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Returns a python float
  • Loss decreases over training
  • Reaches loss < 0.1 in 5000 steps (the diagnostic — model actually learned XOR)
  • Works across multiple seeds (not seed-dependent)
  • More hidden units → can still solve it (sanity)