
CHAPTER 2. MACHINE LEARNING BASICS
is not None. Also defining all the class variables in the initialization function is good for code
readability.
de f i n i t ( s e l f , k=3) :
s e l f . x t r a i n = None
s e l f . y t r a i n = None
s e l f . k = k
You will need to implement the remaining predict(x) and fit(x,y) functions in the knn.py
file. The fit(x,y) function takes two arguments: the training data x and the training labels y.
We will be using the Scikit-learn convention for the input data format. The input variable x is a
N × D matrix where N is the number of data samples, and D is the number of feature dimensions.
In other words, the rows of x correspond to data samples and the columns correspond to features.
The variable y is a vector (or equivalently a Numpy matrix with one dimension) of size N. A typical
fit function would run some optimization procedure to find parameters of the model. For the KNN
case, the parameters of the model are the training data itself, so we only need to store the training
data in the fit(x,y) function.
Most of the implementation occurs in the predict(x) function. For a given testing data point,
we need to compute the Euclidean distance between the testing point and all of the training data
points. The k closest points using these distances are the k-nearest neighbors. We can use the
np.argsort function to get the indices of the neighbors sorted by distance. Using the nearest k
indices, we can find the class labels of the k-nearest neighbors by using the stored labels in the
self.y train variable. Finally, of the k-nearest class labels we take the most common label as the
class prediction. This can be computed using scipy.stats.mode. The input x in the predict(x)
function is of the same format as in the training function (N ×D). This means that the predict(x)
function should handle inputs with multiple data samples, and should return a vector of size N of
the predicted class labels.
Extra Information: KNN Speed
The na¨ıve implementation of the KNN model must loop through every data point in the
training set to compute the distances for a single test point. The computational complexity
of testing a single sample grows with the size of the dataset. This is clearly not desirable
as modern datasets may have thousands or millions of samples. However, we do not need
to check every sample in the training dataset every time we test a sample in KNN. There
are algorithms which attempt to partition the search space in order to make prediction more
efficient. The KD-tree is one such popular algorithm that can be used to speed up KNN [7].
We have provided unit tests in test knn.py. This will use some synthetic data to make sure
the nearest neighbors computation is correct. For running on a dataset we do not use unit testing,
because it is difficult to know what the output should be for a large dataset before training the model.
For this we have provided a small script run knn.py that shows how to train and test on a dataset.
We will use a synthetic dataset (Figure 2.3) to test our model because it is easy to visualize the input
features. The synthetic dataset is implemented in the datasets.py which we can import to test
our code. The dataset consists of two classes that are randomly generated from two 2-d Gaussian
distributions. The classes slightly overlap which makes the learning problem difficult because some
data points are ambiguous. In a real world problem, the two dimensions could correspond to two
different features of the data. For example, the x axis could be height of an individual, the y axis
could represent the peak running speed of that individual. The label could be whether or not that
person played basketball in high school. Although the input features should be good indicators of
the output label, there may not be a clear separation with these features. For example, there could
be some people that are tall and fast, yet didn’t play basketball.
25