Conv2D forward (naive, stride 1, no pad)
Background
The 2D convolution is the first building block of every CNN and the most-asked computer-vision implementation question. A kernel slides over the spatial dimensions of a feature map, and at each position it computes a dot product between the kernel weights and the patch of input under it, summed over all input channels. Note the deep-learning convention: what frameworks call "convolution" is actually cross-correlation — the kernel is not flipped. This is the naive, three-loop starting point; later problems in the series add stride/padding and the backward pass.
Problem statement
Implement conv2d_naive(x, W): the 2D convolution forward pass for a single example, with stride 1 and no padding, using cross-correlation. For every output channel and spatial position :
The output's spatial size is , and each output cell is a sum over elements.
Input
x—np.ndarrayof shape(C_in, H, W): the input feature map ( channels, height , width ).W—np.ndarrayof shape(C_out, C_in, kH, kW): the kernel — one(C_in, kH, kW)filter per output channel.
Output
Returns an np.ndarray of shape (C_out, H - kH + 1, W - kW + 1), the output feature map.
Examples
Example 1 — single channel, a difference kernel
Input: x = [[[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]]] # shape (1, 3, 4)
W = [[[[1, 0],
[0, -1]]]] # shape (1, 1, 2, 2)
Output: [[[-5, -5, -5],
[-5, -5, -5]]] # shape (1, 2, 3)
Explanation: each output cell is x[i,j]·1 + x[i+1,j+1]·(-1) (the other two kernel weights are 0). For the top-left window that is ; every window of this input has a top-left/bottom-right difference of , so the whole output is .
Example 2 — summing across input channels
Input: x: shape (2, 3, 3) with channel 0 all 1s and channel 1 all 2s
W = np.ones((1, 2, 2, 2)) # one output channel, all-ones kernel
Output: shape (1, 2, 2), every cell = 12
Explanation: each output cell sums a patch over both channels: . This is the cross-channel reduction — the result is one value per output channel, not per input channel.
Constraints
- Cross-correlation, no kernel flip — do not reverse the kernel; flipping it fails the reference test.
- Stride is 1 and there is no padding, so the output spatial size is .
x's channel count andW's second axis must match (); each output cell reduces over all elements.- A identity kernel (
W[c, c, 0, 0] = 1) must reproduce the input exactly. - Tests compare against a reference with
atol=1e-9.
Notes
- Complexity. The naive triple loop is — correct but slow. Faster methods (im2col + matmul, FFT, the Toeplitz trick) reuse the same indexing.
- Series.
build-cnn-02adds stride and padding to this;build-cnn-05implements the backward pass.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape is (C_out, H-kH+1, W-kW+1)
- •Matches reference on a 1-channel case
- •Matches reference on a multi-channel case (random input)
- •Identity kernel returns the centre region of the input (when kH=kW=1)
- •Sums across input channels (deterministic crisp test)