GELU forward (tanh approximation)
Background
GELU (Gaussian Error Linear Unit) is the smooth, probabilistic cousin of ReLU that GPT-2, BERT, and most transformer-era architectures adopted. The exact definition is , where is the standard-normal CDF — but that is slow, so OpenAI introduced a tanh approximation (used by GPT-2, nanoGPT, and nn.GELU(approximate="tanh")) that matches the exact form within and is ~3× faster.
Problem statement
Implement gelu(x) using the tanh approximation, elementwise:
The constants are not analytically clean — and is exact from Hendrycks & Gimpel (2016); use them as given.
Input
x—np.ndarrayof any shape.
Output
Returns an np.ndarray of the same shape.
Examples
Example 1 — a few hand-checked points
Input: x = [-0.5, 0.0, 2.0]
Output: ≈ [-0.154, 0.0, 1.954]
Explanation: exactly (the tanh argument is 0). — already close to . — a small negative value, not zero.
Example 2 — asymptotes, and the negative leak
Input: x = [1000, -1000] -> ≈ [1000, 0]
x = [-0.5] -> ≈ -0.154 (ReLU would give 0)
Explanation: for large positive , so ; for large negative , so . In between, moderately-negative inputs pass a little signal through — unlike ReLU, which hard-zeros every .
Constraints
- Translate the formula directly; it is a single elementwise expression over
x. - exactly; for , ; for , .
- A moderate negative like produces a small negative output (), not 0.
- Use the constants as given — do not re-derive them.
Notes
- Why GELU over ReLU. Its gradient is continuous at 0 (no kink), which behaves better with Adam, and it lets a small fraction of negative input through, avoiding ReLU's "dying unit" problem.
- Pairs with the backward. See GELU backward for the derivative; modern refinements (SwiGLU, GeGLU) build on this, but GELU is still the default in most public transformers.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape matches input
- •GELU(0) = 0 (identity at the origin)
- •Diagnostic: matches the tanh approximation formula at hand-checked points
- •Large positive x: GELU(x) ~ x
- •Large negative x: GELU(x) ~ 0
- •GELU lets some negative input through (vs ReLU which would zero it)