Linear backward (chain rule)
Background
Backpropagation through a linear layer is the most-asked backprop derivation in interviews — and the one most candidates fumble. It is pure chain rule applied to the matmul , and it yields three gradients (one each for the input, the weight, and the bias). The reassuring part: once you fix the shapes, there is exactly one way to multiply the matrices that lines up, so the dimensions force the right answer.
Problem statement
Implement linear_backward(grad_out, x, W). The forward pass (from build-nn-01) was . Given the upstream gradient of shape (B, out), return (dx, dW, db):
The bias gradient sums across the batch because the same bias is added to every row in the forward pass.
Input
grad_out—np.ndarrayof shape(B, out_features): the gradient of the loss w.r.t. the output .x—np.ndarrayof shape(B, in_features): the input that was fed to the forward pass.W—np.ndarrayof shape(in_features, out_features): the weight matrix from the forward pass.
Output
Returns a tuple (dx, dW, db) with shapes (B, in), (in, out), and (out,) respectively.
Examples
Example 1 — the three gradients, hand-checked ()
Input: x = [[1, 2]], W = [[1, 0, 1],
[0, 1, 1]], grad_out = [[1, 1, 1]]
Output: dx = [[2, 2]]
dW = [[1, 1, 1],
[2, 2, 2]]
db = [1, 1, 1]
Explanation: ; ; sums over the (single) batch row.
Example 2 — the bias gradient sums over the batch
Input: grad_out = [[1, 2],
[3, 4]] (B=2, out=2)
Output: db = [1+3, 2+4] = [4, 6]
Explanation: each output coordinate's bias contributed to both batch rows, so its gradient is the column sum of grad_out.
Constraints
dx = grad_out @ W.T→(B, in);dW = x.T @ grad_out→(in, out);db = grad_out.sum(axis=0)→(out,).- The bias gradient is summed over
axis=0(the batch), not averaged. - Shapes must match exactly:
grad_outis(B, out),Wis(in, out)— onlygrad_out @ W.Tyields(B, in). - The diagnostic finite-difference cross-check is on
dW(the trickiest of the three), compared withatol≈1e-3.
Notes
- Shapes force the algebra. With
grad_out (B,out),W (in,out),x (B,in), there is exactly one dimension-consistent product for each gradient — a reliable sanity check when you forget which operand to transpose. - Series. This is the backward for the Linear forward layer; together with the ReLU/sigmoid and MSE backwards, it completes the pieces needed to train the XOR MLP.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Returns three arrays with correct shapes
- •dx matches grad_out @ W.T
- •dW matches x.T @ grad_out
- •db sums grad_out over the batch dimension
- •dW matches finite-difference numerical gradient
- •Larger batch shape: gradients still correct