Conv2D forward (naive, stride 1, no pad)Medium

Conv2D forward (naive, stride 1, no pad)

Background

The 2D convolution is the first building block of every CNN and the most-asked computer-vision implementation question. A kernel slides over the spatial dimensions of a feature map, and at each position it computes a dot product between the kernel weights and the patch of input under it, summed over all input channels. Note the deep-learning convention: what frameworks call "convolution" is actually cross-correlation — the kernel is not flipped. This is the naive, three-loop starting point; later problems in the series add stride/padding and the backward pass.

Problem statement

Implement conv2d_naive(x, W): the 2D convolution forward pass for a single example, with stride 1 and no padding, using cross-correlation. For every output channel oo and spatial position (i,j)(i, j):

out[o,i,j]=c=0Cin1  p=0kH1  q=0kW1x[c,i+p,j+q]    W[o,c,p,q]\text{out}[o, i, j] = \sum_{c=0}^{C_\text{in}-1}\; \sum_{p=0}^{k_H-1}\; \sum_{q=0}^{k_W-1} x[c,\, i+p,\, j+q]\; \cdot\; W[o,\, c,\, p,\, q]

The output's spatial size is (HkH+1)×(WkW+1)(H - k_H + 1) \times (W - k_W + 1), and each output cell is a sum over CinkHkWC_\text{in}\cdot k_H \cdot k_W elements.

Input

  • xnp.ndarray of shape (C_in, H, W): the input feature map (CinC_\text{in} channels, height HH, width WW).
  • Wnp.ndarray of shape (C_out, C_in, kH, kW): the kernel — one (C_in, kH, kW) filter per output channel.

Output

Returns an np.ndarray of shape (C_out, H - kH + 1, W - kW + 1), the output feature map.

Examples

Example 1 — single channel, a difference kernel

Input:  x = [[[ 1,  2,  3,  4],
              [ 5,  6,  7,  8],
              [ 9, 10, 11, 12]]]          # shape (1, 3, 4)
        W = [[[[1, 0],
               [0, -1]]]]                 # shape (1, 1, 2, 2)
Output: [[[-5, -5, -5],
          [-5, -5, -5]]]                  # shape (1, 2, 3)

Explanation: each output cell is x[i,j]·1 + x[i+1,j+1]·(-1) (the other two kernel weights are 0). For the top-left window that is 16=51 - 6 = -5; every 2×22\times2 window of this input has a top-left/bottom-right difference of 5-5, so the whole output is 5-5.

Example 2 — summing across input channels

Input:  x: shape (2, 3, 3) with channel 0 all 1s and channel 1 all 2s
        W = np.ones((1, 2, 2, 2))         # one output channel, all-ones kernel
Output: shape (1, 2, 2), every cell = 12

Explanation: each output cell sums a 2×22\times2 patch over both channels: 41ch 0+42ch 1=12\underbrace{4\cdot 1}_{\text{ch }0} + \underbrace{4\cdot 2}_{\text{ch }1} = 12. This is the cross-channel reduction — the result is one value per output channel, not per input channel.

Constraints

  • Cross-correlation, no kernel flip — do not reverse the kernel; flipping it fails the reference test.
  • Stride is 1 and there is no padding, so the output spatial size is (HkH+1)×(WkW+1)(H-k_H+1)\times(W-k_W+1).
  • x's channel count and W's second axis must match (CinC_\text{in}); each output cell reduces over all CinkHkWC_\text{in}\cdot k_H\cdot k_W elements.
  • A 1×11\times1 identity kernel (W[c, c, 0, 0] = 1) must reproduce the input exactly.
  • Tests compare against a reference with atol=1e-9.

Notes

  • Complexity. The naive triple loop is O(CoutHWCinkHkW)O(C_\text{out}\cdot H\cdot W\cdot C_\text{in}\cdot k_H\cdot k_W) — correct but slow. Faster methods (im2col + matmul, FFT, the Toeplitz trick) reuse the same indexing.
  • Series. build-cnn-02 adds stride and padding to this; build-cnn-05 implements the backward pass.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape is (C_out, H-kH+1, W-kW+1)
  • Matches reference on a 1-channel case
  • Matches reference on a multi-channel case (random input)
  • Identity kernel returns the centre region of the input (when kH=kW=1)
  • Sums across input channels (deterministic crisp test)