Nearest Neighbor Classifier

Prequel:

The dataset I will refer to in order to help understand this classifier is the CIFAR-10 dataset.

The CIFAR-10 has 50000 training images and 10000 test images. There are 10 labels, each with 5000 images in the training dataset.

The data is arranged into different batches, each with dimension 10000 x 3072. The images are 32x32x3 since there are 32x32 pixels, and RGB channels.

I like to think of each 1x3072 entry as an image.

Associated with each batch is an array of labels, with dimensionality 10000 x 1, index i of the array is the i-th image in the dataset, the value at index i is the label of that image.

There is a label_names array that can be used to get the name of the label from the i-th entry in the label array.

We will combine all the training batches to get an array of shape 50000 x 3072 for the data, 50000 x 1 for the labels, and 10 x 1 for the label_names.

Now that we understand our dataset a little better, we can begin talking about the classifier!

Nearest Neighbor Classifier:

The nearest neighbor classifier is a super simple way for us to understand the classification problem.

Basically, we save ALL the training data into our array of 50000 x 3072, where the i-th entry of 3072 values is an image from the training dataset.

Then we go to our test dataset, which has dimensions 10000 x 3072, where the i-th entry of 3072 values is an image from the test dataset.

We take each image from our test dataset, and compare it to all 50000 images in our training dataset.

To compare images, we simply take the euclidean distance of each entry in the training dataset with the current i-th entry in the test dataset. This can easily be done in python with:

euclidean_distances = np.sqrt(np.sum(np.square(training_dataset - test_dataset[i,:]), axis = 1))
Quick breakdown of this code:
  • Axis = 1, means to perform the sum along the rows, which makes sense because we’re comparing 3072 values with 3072 values this way.
  • The [i,:] means the i-th row, this also makes sense because we want to compare using the i-th row in the test_dataset.
  • The np.square of the distances returns us an array of dimension 50000 x 3072, where each index contains the square of the 3072 values of the training dataset - the 3072 values of the i-th image in the test dataset.
  • The np.sum will sum up each row for us, giving us an array with dimensionality 50000 x 1.
  • Then we take the square root of all the entries in the euclidean_distances array using np.square.
  • The final dimensionality will be 50000 x 1, where each entry in the array is the euclidean distance of the i-th training image with each of the 50000 training images.

After getting the euclidean distances, the index with the smallest number is what the classifier will predict to be the label of the training image.

We find the index of the minimum distance in the euclidean_distances array, use that index in our labels array of dimensionality 50000 x 1 to find the numeric value of the label the predicted image is, and then use that numeric value to find the name of the label in our label_names array.

This is the 1 nearest neighbor classifier.

k-Nearest Neighbor Classifier is where we take the k smallest distance from the euclidean_distances array, and use those associated images to determine which label the test image is. To determine, we basically just use simply majority, where if a majority of the k neighbors are a certain label, then the classifier will use that as the predicted label.

Note that the nearest neighbor classifier is a relatively dumb classifier and is primarily used to get an understanding of classification.

Written on May 19, 2017