
CHAPTER 3. A NEURAL NETWORK FROM SCRATCH
computed using the value of the input of that layer, which is obtained from the values stored by
the forward function. We keep doing this until we call the backward function of the first layer. In
our implementation of the forward function for the layers, we should have saved all of the values
needed in the backward functions. After the backwards pass, the gradients of each parameter will
be stored by the corresponding layers to be updated.
The update params function We also need a simple function that will update the parameters of
the network by some learning rate. For this function we don’t need to worry about passing data
between layers. We can simply call the update params function of each layer in the network.
The fit function Finally we can implement the fit method to enable our model to be trained.
This method will be similar to the fit function that we implemented for the logistic regression
classifier in Chapter 2 that used gradient descent.
In Chapter 2, we computed the gradient on the entire input data, and then updated the param-
eters. For the Sequential class, we will assume that the size of our data set is very large (> 1000
samples). In this case it may be impossible to calculate the gradients at once as we did in Chapter 2,
because of memory limitations. Instead we will split the data into smaller batches of images. These
batches are often also called mini-batches. This method of batch-wise training is called stochastic
gradient descent. Besides saving memory, it also has some theoretical advantages over training using
the entire dataset.
For a particular mini-batch we need to perform several operations for training. First, we use
the forward function to do the forward pass of the network for the particular batch. After this
pass, the layers should store all the necessary outputs used in the backwards computation. Next we
call the backward function to compute the gradients of all of the layers. Finally, we can call the
update params function to update the parameters of the network, based on this particular batch.
It is important that the mini-batch is a decent sampling of the actual dataset. If the mini-batch
consisted of only a single class, then the network won’t learn differences between classes. The easiest
way to prevent this is to shuffle the data before splitting into mini-batches. In our case, we assume
that the data is already shuffled before calling the fit function.
The predict function Most of the work in the predict function is done by the previously
mentioned forward function. The forward function gives the estimated probability of each class for
each input data sample. We want our Sequential class to be a classifier, so it should return class
labels instead of probabilities for each input data sample. In Chapter 2, we obtained labels from the
probabilities by using a threshold of 0.5. In this case, there are several output classes, so we want to
return the class that has the highest predicted probability. We assume that the predict function is
called on a small batch of data. So we do not need to worry about splitting the data into batches
for this function.
3.4.2 Training the model with CIFAR100 data
Now, assuming all of the tests are passing, we can attempt to train our model on a subset of the
CIFAR100 dataset [19]. This dataset consists of 60,000 32 × 32 pixel color images for 100 classes.
Although the images are of a very small size, humans can still recognize the class of the image
(Figure 3.3). The CIFAR datasets are often used for machine learning research because the small
image size makes training fast, but there is still a large number of images so we can fit large models.
In the hands-on exercises of this chapter, we will train and test on a 5-class subset of the
CIFAR100 dataset. The restriction to 5 classes makes the network faster to train, as well as makes
the problem a little easier for our network to learn. Each individual will train/test on a different
subset of the dataset. Testing accuracies of different sets can vary since different class combinations
can be more difficult.
45