
CHAPTER 4. A CONVOLUTIONAL NEURAL NETWORK FROM SCRATCH
Table 4.1: Data dimensions
Variable Description 1st dim 2nd dim 3rd dim 4th dim
x Input batch size # of input
channels
rows columns
forward(x) Output of forward
pass
batch size # of output
channels
rows columns
y grad Gradient at output batch size # of output
channels
rows columns
backward(y grad) Output of back-
ward pass
batch size # of input
channels
rows columns
W Weight tensor # of output
channels
# of input
channels
rows columns
W grad Gradient of weight
tensor
# of output
channels
# of input
channels
rows columns
b Bias vector 1 # of output
channels
- -
b grad Gradient of bias
vector
1 # of output
channels
- -
where x
i
is a (1, n
i
, n
r
, n
c
) shaped input from the previous layer, W
o
is a single filter corresponding
to the oth output, b
o
is a single bias corresponding to the output indexed by o, and ∗∗ is the
convolution operator.
One thing to be careful with convolution is the behavior at the edges of the input. What is
the behavior of convolution if the receptive field overlaps with the border? In valid convolution,
the outputs of the locations that overlap with the border are ignored. In full convolution, we pad
the inputs with some value (typically 0) and then we are able to compute outputs corresponding to
every input. In our implementation we will be using full convolution.
In signal processing, we are taught to flip the filter before we slide the filter across the input. Many
machine learning libraries do not perform such flipping, because the flipping makes no difference
when training a convolutional neural network. In our implementation, we will be doing convolution
without flipping the filter to match the implementation in common machine learning libraries (e.g.
PyTorch). We can use the scipy.signal.correlate to perform the convolution with flipped filters.
If the scipy.signal.convolve is used instead, you need to first flip the filters to obtain the same
result as scipy.signal.correlate.
We will illustrate the desired behavior of the forward pass of the convolutional network with a
small example. The input to our layer will be of size 1, 2, 4, 4. This would correspond to a batch
with a single image with two channels, and spatial dimensions of 4 × 4. We will apply 2 filters of
size 2, 3, 3, so our weight tensor will be of size 2, 2, 3, 3. The output will be of size 1, 2, 4, 4. Figure
4.3 shows this convolution operation for a fake input without using a bias term.
We could implement the convolution ourself with loops. It is easier, however, to use a convolution
function from one of the included Python libraries. In this case we need to do convolution with flipped
filters, also called correlation. We can use the scipy.signal.correlate function. Our inputs and
outputs have four dimensions (batch, channel, rows, columns), but we only want to do proper
convolution for two of the dimensions (rows and columns). If we use the scipy.signal.correlate
with “full” correlation naively, then we will not get the expected result, because the function will
perform zero padding in all dimensions. The “valid” setting does not perform zero padding, but we
actually do want zero padding for the row and column dimensions. So we can either zero pad the
data in the appropriate dimensions ourselves (using the np.pad function), or we can use loops to
perform separate convolutions.
52