Dynamic-Tanh (DyT)Medium

Dynamic-Tanh (DyT)

Background

Dynamic Tanh (DyT) is a 2025 drop-in replacement for LayerNorm in transformers (Zhu et al., "Transformers without Normalization"). Instead of computing per-token statistics, it squashes activations with a learnable-scaled tanh and then applies an affine transform. It matches LayerNorm's stabilising effect with no normalization statistics at all.

Problem statement

Implement dynamic_tanh(x, alpha, gamma, beta):

DyT(x)=γtanh(αx)+β\text{DyT}(x) = \gamma \odot \tanh(\alpha x) + \beta

Apply tanh\tanh to αx\alpha x elementwise, then the channel-wise affine γ,β\gamma, \beta. Round to 4 decimals and return a (nested) list.

Input

  • xnp.ndarray: input tensor (e.g. shape (..., C)).
  • alphafloat: the scalar inside tanh.
  • gammanp.ndarray (C,): per-channel scale.
  • betanp.ndarray (C,): per-channel shift.

Output

Returns a (nested) list matching x's shape, each value rounded to 4 decimals.

Examples

Example 1

Input:  x = [[[0.14115588, 0.00372817, 0.24126647, 0.22183601]]], alpha = 0.5,
        gamma = [1, 1, 1, 1], beta = [0, 0, 0, 0]
Output: [[[0.0705, 0.0019, 0.1201, 0.1105]]]

Explanation: each element is scaled by α=0.5\alpha=0.5, passed through tanh\tanh, then ×γ+β\times\gamma + \beta (identity here). For instance tanh(0.50.1412)=tanh(0.0706)0.0705\tanh(0.5\cdot0.1412)=\tanh(0.0706)\approx0.0705.

Constraints

  • Compute tanh(αx)\tanh(\alpha x) elementwise, then broadcast-multiply by γ\gamma and add β\beta over the last (channel) axis.
  • Round results to 4 decimals; return a nested list.
  • gamma/beta broadcast over the channel dimension.

Notes

  • Unlike LayerNorm, DyT uses no batch/sequence statistics — only the learnable scalar α\alpha and the channel affine — so it adds no normalization overhead at inference.
  • α\alpha controls how aggressively large activations are squashed; γ,β\gamma, \beta restore representational range, just as LayerNorm's affine does.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example
  • Identity affine equals tanh(alpha*x)
  • Channel affine is applied
  • alpha = 0 collapses output to beta