Linear backward (chain rule)Medium

Linear backward (chain rule)

Background

Backpropagation through a linear layer is the most-asked backprop derivation in interviews — and the one most candidates fumble. It is pure chain rule applied to the matmul y=xW+by = xW + b, and it yields three gradients (one each for the input, the weight, and the bias). The reassuring part: once you fix the shapes, there is exactly one way to multiply the matrices that lines up, so the dimensions force the right answer.

Problem statement

Implement linear_backward(grad_out, x, W). The forward pass (from build-nn-01) was y=xW+by = xW + b. Given the upstream gradient g=L/yg = \partial L/\partial y of shape (B, out), return (dx, dW, db):

dx=gW    (B,in),dW=xg    (in,out),db=b=1Bgb,:    (out,)dx = g\,W^\top \;\; (B,\text{in}), \qquad dW = x^\top g \;\; (\text{in},\text{out}), \qquad db = \sum_{b=1}^{B} g_{b,:} \;\; (\text{out},)

The bias gradient sums across the batch because the same bias is added to every row in the forward pass.

Input

  • grad_outnp.ndarray of shape (B, out_features): the gradient of the loss w.r.t. the output yy.
  • xnp.ndarray of shape (B, in_features): the input that was fed to the forward pass.
  • Wnp.ndarray of shape (in_features, out_features): the weight matrix from the forward pass.

Output

Returns a tuple (dx, dW, db) with shapes (B, in), (in, out), and (out,) respectively.

Examples

Example 1 — the three gradients, hand-checked (B=1B=1)

Input:  x = [[1, 2]], W = [[1, 0, 1],
                           [0, 1, 1]], grad_out = [[1, 1, 1]]
Output: dx = [[2, 2]]
        dW = [[1, 1, 1],
              [2, 2, 2]]
        db = [1, 1, 1]

Explanation: dx=gW=[1,1,1]W=[2,2]dx = g\,W^\top = [1,1,1]\cdot W^\top = [2, 2]; dW=xg=[[1],[2]][1,1,1]=[[1,1,1],[2,2,2]]dW = x^\top g = [[1],[2]]\cdot[1,1,1] = [[1,1,1],[2,2,2]]; dbdb sums gg over the (single) batch row.

Example 2 — the bias gradient sums over the batch

Input:  grad_out = [[1, 2],
                    [3, 4]]   (B=2, out=2)
Output: db = [1+3, 2+4] = [4, 6]

Explanation: each output coordinate's bias contributed to both batch rows, so its gradient is the column sum of grad_out.

Constraints

  • dx = grad_out @ W.T(B, in); dW = x.T @ grad_out(in, out); db = grad_out.sum(axis=0)(out,).
  • The bias gradient is summed over axis=0 (the batch), not averaged.
  • Shapes must match exactly: grad_out is (B, out), W is (in, out) — only grad_out @ W.T yields (B, in).
  • The diagnostic finite-difference cross-check is on dW (the trickiest of the three), compared with atol≈1e-3.

Notes

  • Shapes force the algebra. With grad_out (B,out), W (in,out), x (B,in), there is exactly one dimension-consistent product for each gradient — a reliable sanity check when you forget which operand to transpose.
  • Series. This is the backward for the Linear forward layer; together with the ReLU/sigmoid and MSE backwards, it completes the pieces needed to train the XOR MLP.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Returns three arrays with correct shapes
  • dx matches grad_out @ W.T
  • dW matches x.T @ grad_out
  • db sums grad_out over the batch dimension
  • dW matches finite-difference numerical gradient
  • Larger batch shape: gradients still correct