Max pooling 2D forward
Background
Max pooling is the simplest downsampling operation in a CNN. For each channel independently, it slides a k × k window with step stride and keeps the maximum value inside. Max (rather than, say, the mean) survived the deep-learning era for two reasons: translation invariance — a small input shift that keeps the strongest activation in the same window gives the same output, useful for "is there an edge anywhere in this region?" features — and sparse gradients — on the backward pass only the argmax position receives gradient, so pooling routes rather than smooths. The classic config is k = 2, stride = 2, which halves and .
Problem statement
Implement max_pool_2d(x, kernel_size, stride): for each channel and output position , take the maximum over the k × k window:
with output spatial size (no padding, no dilation):
Channels are pooled independently — the output keeps the same , only and shrink.
Input
x—np.ndarrayof shape(C, H, W): the input feature map.kernel_size—int: the spatial size of the pooling window.stride—int: the step between windows.
Output
Returns an np.ndarray of shape (C, out_H, out_W) using the floor formula above.
Examples
Example 1 — classic pooling
Input: x = [[[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]]] # shape (1, 4, 4)
kernel_size=2, stride=2
Output: [[[ 6, 8],
[14, 16]]] # shape (1, 2, 2)
Explanation: the input splits into four non-overlapping blocks and each output cell is that block's max — top-left , top-right , and so on.
Example 2 — stride larger than the kernel skips data
Input: x = np.arange(25).reshape(1, 5, 5), kernel_size=2, stride=3
Output: shape (1, 2, 2); out[0,0,0] = 6, out[0,1,1] = 24
Explanation: . With stride kernel the windows do not overlap and skip row/column index 2 entirely. The top-left window covers ; the bottom-right window covers .
Constraints
- Pool per channel — never take a max across the channel axis; is unchanged.
- Output sizing uses floor division: (no padding, no dilation).
- The window for output
(i, j)starts at(i*stride, j*stride). kernel_size == stride == 1is the identity (returnsxunchanged); a constant input yields that same constant everywhere.
Notes
- Routing, not smoothing. Because only the argmax in each window matters, pooling preserves the strongest activation and discards the rest — this is also why its backward pass sends gradient to a single position per window.
- Series.
build-cnn-04pairs this with average pooling; both remain useful in modern CNNs even alongside global pooling and learned (strided-conv) downsampling.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape: (C, out_H, out_W) per the standard formula
- •Diagnostic: 4x4 input with k=2 stride=2 produces correct 2x2 maxes
- •Pooling is per-channel — no mixing across the C axis
- •Stride > kernel_size: non-overlapping windows skip data correctly
- •Constant input: output is the same constant everywhere
- •kernel=stride=1 is the identity