Conv2D forward (padding + stride)Medium

Conv2D forward (padding + stride)

Background

This is the convolution real CNNs actually use: the cross-correlation of build-cnn-01 plus two extra knobs — zero-padding and stride. Padding adds a border of zeros so the kernel can sit over the edges (and, chosen right, keeps the spatial size unchanged); stride is how far the kernel hops each step, which downsamples the feature map. Together they are the bedrock of CNN architectures: stack padded stride-1 convs to learn features without shrinking, use stride-2 convs to halve resolution.

Problem statement

Implement conv2d(x, W, padding=0, stride=1): zero-pad the input by padding on each spatial side, then slide the kernel with step stride, computing cross-correlation (no kernel flip). Let x~\tilde{x} be x zero-padded by pp on each side of HH and WW. For each output channel oo and position (i,j)(i, j):

out[o,i,j]=c=0Cin1  u=0kH1  v=0kW1x~[c,  is+u,  js+v]    W[o,c,u,v]\text{out}[o, i, j] = \sum_{c=0}^{C_\text{in}-1}\;\sum_{u=0}^{k_H-1}\;\sum_{v=0}^{k_W-1} \tilde{x}\big[c,\; i\,s + u,\; j\,s + v\big]\; \cdot\; W[o, c, u, v]

with output spatial size (floor division, p=p = padding, s=s = stride):

outH=H+2pkHs+1,outW=W+2pkWs+1\text{out}_H = \left\lfloor \frac{H + 2p - k_H}{s} \right\rfloor + 1, \qquad \text{out}_W = \left\lfloor \frac{W + 2p - k_W}{s} \right\rfloor + 1

Input

  • xnp.ndarray of shape (C_in, H, W): the input feature map.
  • Wnp.ndarray of shape (C_out, C_in, kH, kW): the kernel.
  • paddingint: number of zero cells added on each side of HH and WW (the input becomes (C_in, H+2p, W+2p)).
  • strideint: the step size when sliding the kernel.

Output

Returns an np.ndarray of shape (C_out, out_H, out_W) using the floor-division formula above.

Examples

Example 1 — zero-padding, "same" output (k=3,p=1,s=1k=3, p=1, s=1)

Input:  x = np.ones((1, 3, 3)), W = np.ones((1, 1, 3, 3)), padding=1, stride=1
Output: shape (1, 3, 3);  out[0,0,0] = 4,  out[0,1,1] = 9

Explanation: padding 1 turns the 3×33\times3 input into a 5×55\times5 map with a zero border, and p=(k1)/2p=(k-1)/2 with stride 1 preserves the 3×33\times3 output ("same" padding). The top-left window straddles the corner, where only the inner 2×22\times2 block of ones survives → sum 44. The centre window covers a full 3×33\times3 of ones → sum 99. The corner being 44 (not 99) proves the padding is zeros, not edge-replication.

Example 2 — stride 2 downsamples

Input:  x shape (2, 8, 8), W shape (3, 2, 2, 2), padding=0, stride=2
Output: shape (3, 4, 4)

Explanation: outH=(8+02)/2+1=4\text{out}_H = \lfloor (8 + 0 - 2)/2 \rfloor + 1 = 4. A stride of 2 advances the window two cells at a time, roughly halving each spatial dimension; the cross-channel reduction is unchanged from build-cnn-01.

Constraints

  • Padding is zeros only — never edge-replicate or reflect (np.pad(x, ((0,0),(p,p),(p,p))) defaults to zeros).
  • Output sizing uses floor division: (H+2pkH)/s+1\lfloor (H + 2p - k_H)/s \rfloor + 1.
  • The window for output (i, j) starts at (i*stride, j*stride) in the padded input.
  • "Same" padding for an odd kernel is p=(k1)/2p = (k-1)/2 with stride 1 — it preserves H,WH, W.
  • Cross-correlation, no kernel flip (as in build-cnn-01). Tests compare against a reference with atol=1e-9.

Notes

  • Why padding. Without it, every conv shrinks the map by k1k-1; "same" padding lets you stack many conv layers and only downsample deliberately (via stride or pooling).
  • Series. Builds on build-cnn-01 (naive conv); build-cnn-03/04 add pooling, build-cnn-05 the backward pass, and build-cnn-06 composes a full tiny-classifier forward.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • With padding=0, stride=1: matches the no-pad reference
  • Diagnostic: padding=(k-1)//2, stride=1 preserves spatial size ('same' padding)
  • Stride 2 halves the spatial size (with appropriate padding)
  • Combined padding=1 and stride=2 matches reference
  • Padding actually pads with zeros (not edge-replicate, not reflect)