GELU forward (tanh approximation)Easy

GELU forward (tanh approximation)

Background

GELU (Gaussian Error Linear Unit) is the smooth, probabilistic cousin of ReLU that GPT-2, BERT, and most transformer-era architectures adopted. The exact definition is xΦ(x)x\,\Phi(x), where Φ\Phi is the standard-normal CDF — but that is slow, so OpenAI introduced a tanh approximation (used by GPT-2, nanoGPT, and nn.GELU(approximate="tanh")) that matches the exact form within 10410^{-4} and is ~3× faster.

Problem statement

Implement gelu(x) using the tanh approximation, elementwise:

GELU(x)0.5x(1+tanh ⁣(2π(x+0.044715x3)))\text{GELU}(x) \approx 0.5\,x\,\Big(1 + \tanh\!\big(\sqrt{\tfrac{2}{\pi}}\,(x + 0.044715\,x^3)\big)\Big)

The constants are not analytically clean — 2/π0.7978\sqrt{2/\pi}\approx0.7978 and 0.0447150.044715 is exact from Hendrycks & Gimpel (2016); use them as given.

Input

  • xnp.ndarray of any shape.

Output

Returns an np.ndarray of the same shape.

Examples

Example 1 — a few hand-checked points

Input:  x = [-0.5, 0.0, 2.0]
Output: ≈ [-0.154, 0.0, 1.954]

Explanation: GELU(0)=0\text{GELU}(0) = 0 exactly (the tanh argument is 0). GELU(2.0)1.954\text{GELU}(2.0)\approx1.954 — already close to xx. GELU(0.5)0.154\text{GELU}(-0.5)\approx-0.154 — a small negative value, not zero.

Example 2 — asymptotes, and the negative leak

Input:  x = [1000, -1000]   -> ≈ [1000, 0]
        x = [-0.5]          -> ≈ -0.154        (ReLU would give 0)

Explanation: for large positive xx, tanh1\tanh\to 1 so GELU0.5x2=x\text{GELU}\to 0.5\,x\cdot 2 = x; for large negative xx, tanh1\tanh\to -1 so GELU0\text{GELU}\to 0. In between, moderately-negative inputs pass a little signal through — unlike ReLU, which hard-zeros every x0x\le 0.

Constraints

  • Translate the formula directly; it is a single elementwise expression over x.
  • GELU(0)=0\text{GELU}(0) = 0 exactly; for x0x \gg 0, GELU(x)x\text{GELU}(x)\to x; for x0x \ll 0, GELU(x)0\text{GELU}(x)\to 0.
  • A moderate negative like x=0.5x=-0.5 produces a small negative output (0.154\approx-0.154), not 0.
  • Use the constants as given — do not re-derive them.

Notes

  • Why GELU over ReLU. Its gradient is continuous at 0 (no kink), which behaves better with Adam, and it lets a small fraction of negative input through, avoiding ReLU's "dying unit" problem.
  • Pairs with the backward. See GELU backward for the derivative; modern refinements (SwiGLU, GeGLU) build on this, but GELU is still the default in most public transformers.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape matches input
  • GELU(0) = 0 (identity at the origin)
  • Diagnostic: matches the tanh approximation formula at hand-checked points
  • Large positive x: GELU(x) ~ x
  • Large negative x: GELU(x) ~ 0
  • GELU lets some negative input through (vs ReLU which would zero it)