GELU backward (tanh approximation)
Background
GELU is the smooth activation that replaced ReLU in transformers (GPT, BERT). The forward (tanh approximation) is one line; the backward is a good product-rule + chain-rule workout — there are five separate places x appears in the derivative, so it is easy to drop a term. Getting it right (and checking it against a numerical gradient) is exactly the skill you need to debug a real backprop bug.
Problem statement
Implement gelu_backward(grad_out, x), the derivative of GELU (tanh approximation) w.r.t. its input, times the upstream gradient. With
the derivative of is
Return grad_out * d_GELU/dx. (The constant comes from differentiating the term.)
Input
grad_out—np.ndarray: the upstream gradient w.r.t. the GELU output (same shape asx).x—np.ndarray: the input that was fed to GELU (not the output).
Output
Returns an np.ndarray of the same shape — the gradient w.r.t. x.
Examples
Example 1 — the slope at
Input: grad_out = [1.0], x = [0.0]
Output: [0.5]
Explanation: so ; the second term carries a factor of , so the derivative is .
Example 2 — the asymptotic slopes
Input: x = [10, 100, 1000], grad_out = [1,1,1] -> ≈ [1, 1, 1]
x = [-10, -100], grad_out = [1,1] -> ≈ [0, 0]
Explanation: as , and the second term decays, so the slope (GELU identity for large positive ). As , and both terms collapse, so the slope (GELU saturates to 0, like ReLU's left side but smooth).
Constraints
- Use the product rule on plus the chain rule on ; .
gelu_backwardtakes the inputx, not the GELU output.- Multiply the local derivative by
grad_out(it scales the result elementwise). - The analytic gradient must match a finite-difference numerical gradient within
atol≈1e-3— the test that catches any sign or missing-term error.
Notes
- Compute
tanh(g(x))once.np.tanhis the expensive call, and you need it for both and ; reuse it rather than recomputing. - Pairs with the forward. This is the backward for GELU forward; together they're the activation inside every transformer MLP block.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape matches input
- •d_GELU/dx at x=0 is 0.5 (sanity)
- •Diagnostic: matches finite-difference numerical gradient
- •grad_out is broadcast (multiplied) into the result
- •Large positive x: d_GELU/dx -> 1 (looks like x for large x, slope -> 1)
- •Large negative x: d_GELU/dx -> 0 (saturates to 0)