Max pooling 2D forwardEasy

Max pooling 2D forward

Background

Max pooling is the simplest downsampling operation in a CNN. For each channel independently, it slides a k × k window with step stride and keeps the maximum value inside. Max (rather than, say, the mean) survived the deep-learning era for two reasons: translation invariance — a small input shift that keeps the strongest activation in the same window gives the same output, useful for "is there an edge anywhere in this region?" features — and sparse gradients — on the backward pass only the argmax position receives gradient, so pooling routes rather than smooths. The classic config is k = 2, stride = 2, which halves HH and WW.

Problem statement

Implement max_pool_2d(x, kernel_size, stride): for each channel cc and output position (i,j)(i, j), take the maximum over the k × k window:

out[c,i,j]=max0u<k0v<k  x[c,  is+u,  js+v]\text{out}[c, i, j] = \max_{\substack{0 \le u < k \\ 0 \le v < k}} \; x\big[c,\; i\,s + u,\; j\,s + v\big]

with output spatial size (no padding, no dilation):

outH=Hks+1,outW=Wks+1\text{out}_H = \left\lfloor \frac{H - k}{s} \right\rfloor + 1, \qquad \text{out}_W = \left\lfloor \frac{W - k}{s} \right\rfloor + 1

Channels are pooled independently — the output keeps the same CC, only HH and WW shrink.

Input

  • xnp.ndarray of shape (C, H, W): the input feature map.
  • kernel_sizeint: the spatial size kk of the pooling window.
  • strideint: the step ss between windows.

Output

Returns an np.ndarray of shape (C, out_H, out_W) using the floor formula above.

Examples

Example 1 — classic 2×22\times2 pooling

Input:  x = [[[ 1,  2,  3,  4],
              [ 5,  6,  7,  8],
              [ 9, 10, 11, 12],
              [13, 14, 15, 16]]]        # shape (1, 4, 4)
        kernel_size=2, stride=2
Output: [[[ 6,  8],
          [14, 16]]]                    # shape (1, 2, 2)

Explanation: the input splits into four non-overlapping 2×22\times2 blocks and each output cell is that block's max — top-left max{1,2,5,6}=6\max\{1,2,5,6\}=6, top-right max{3,4,7,8}=8\max\{3,4,7,8\}=8, and so on.

Example 2 — stride larger than the kernel skips data

Input:  x = np.arange(25).reshape(1, 5, 5), kernel_size=2, stride=3
Output: shape (1, 2, 2);  out[0,0,0] = 6,  out[0,1,1] = 24

Explanation: outH=(52)/3+1=2\text{out}_H = \lfloor (5-2)/3 \rfloor + 1 = 2. With stride 3>3 > kernel 22 the windows do not overlap and skip row/column index 2 entirely. The top-left window covers {0,1,5,6}6\{0,1,5,6\}\to 6; the bottom-right window covers {18,19,23,24}24\{18,19,23,24\}\to 24.

Constraints

  • Pool per channel — never take a max across the channel axis; CC is unchanged.
  • Output sizing uses floor division: (Hk)/s+1\lfloor (H-k)/s \rfloor + 1 (no padding, no dilation).
  • The window for output (i, j) starts at (i*stride, j*stride).
  • kernel_size == stride == 1 is the identity (returns x unchanged); a constant input yields that same constant everywhere.

Notes

  • Routing, not smoothing. Because only the argmax in each window matters, pooling preserves the strongest activation and discards the rest — this is also why its backward pass sends gradient to a single position per window.
  • Series. build-cnn-04 pairs this with average pooling; both remain useful in modern CNNs even alongside global pooling and learned (strided-conv) downsampling.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape: (C, out_H, out_W) per the standard formula
  • Diagnostic: 4x4 input with k=2 stride=2 produces correct 2x2 maxes
  • Pooling is per-channel — no mixing across the C axis
  • Stride > kernel_size: non-overlapping windows skip data correctly
  • Constant input: output is the same constant everywhere
  • kernel=stride=1 is the identity