Dynamic-Tanh (DyT)
Background
Dynamic Tanh (DyT) is a 2025 drop-in replacement for LayerNorm in transformers (Zhu et al., "Transformers without Normalization"). Instead of computing per-token statistics, it squashes activations with a learnable-scaled tanh and then applies an affine transform. It matches LayerNorm's stabilising effect with no normalization statistics at all.
Problem statement
Implement dynamic_tanh(x, alpha, gamma, beta):
Apply to elementwise, then the channel-wise affine . Round to 4 decimals and return a (nested) list.
Input
x—np.ndarray: input tensor (e.g. shape(..., C)).alpha—float: the scalar inside tanh.gamma—np.ndarray(C,): per-channel scale.beta—np.ndarray(C,): per-channel shift.
Output
Returns a (nested) list matching x's shape, each value rounded to 4 decimals.
Examples
Example 1
Input: x = [[[0.14115588, 0.00372817, 0.24126647, 0.22183601]]], alpha = 0.5,
gamma = [1, 1, 1, 1], beta = [0, 0, 0, 0]
Output: [[[0.0705, 0.0019, 0.1201, 0.1105]]]
Explanation: each element is scaled by , passed through , then (identity here). For instance .
Constraints
- Compute elementwise, then broadcast-multiply by and add over the last (channel) axis.
- Round results to 4 decimals; return a nested list.
gamma/betabroadcast over the channel dimension.
Notes
- Unlike LayerNorm, DyT uses no batch/sequence statistics — only the learnable scalar and the channel affine — so it adds no normalization overhead at inference.
- controls how aggressively large activations are squashed; restore representational range, just as LayerNorm's affine does.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example
- •Identity affine equals tanh(alpha*x)
- •Channel affine is applied
- •alpha = 0 collapses output to beta