Linear Classification

Data for this post:

Here I will present the sample data that I will be using for this post.

input_x:

     
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
input_x = np.array([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5]])

input_x has 5 training examples (rows), each with 3 dimensions (columns). The dimensions are basically the data related to the input.

weights:

     
1 2 3
4 5 6
7 8 9
weights = np.array([[1,2,3],[4,5,6],[7,8,9]])

weights has shape number_of_dimensions x number_of_classes, thus we have 3 rows, one for each dimension, and then 3 columns, one for each class.

bias:

     
3 5 2
bias = np.array([3,5,2])

bias has dimensionality of 3x1, which is number_of_class x 1.

labels:

         
0 0 1 1 2
labels = np.array([0,0,1,1,2])

labels has dimensionality 5x1, which is number_of_training_examples x 1.

labels[i] = some_label means input_x[i] is some_label.

We have 3 classes: 0 = first 1 = second 2 = third

Quick overview of linear classification:

  • Predict which class the input data is based on by scoring the input data against all classes and choosing the class with the highest score, we will use something called a score function for this.
  • Determine how good or bad our scores are by comparing the incorrect class scores with the correct class score, this is done with a loss function
  • Find the best way to update our weights from the results of the loss function by using gradient descent (taking partial derivative of the loss function with respect to all the rows, the correct class and incorrect classes).
  • Update the weights by scaling the gradient with a constant and adding it to our weights.

Score function:

To calculate the scores, we will use the following linear function:

f(x, W, b) = Wx + b

This is the same as:

f(input_x, weights, bias) = (weights)(input_x) + bias

  • input_x has shape 5 x 3, the 5 comes from number of training examples, and the 3 comes from the number of data values associated with each training example.
  • weights has dimensionality 3 x 3.
  • bias is a 3x1 vector that influences the scores.

After the scores are returned, for each training example, we will choose the class with the highest score to be the class of that training example.

A trick we can do to simplify our calculations is to combine the weight and bias.

We first do so by changing the dimensionality of input_x from 5 x 3 to 5 x 4, we add an extra column with the value 1.

This makes our new input_x:

       
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 5 5 1
input_x = np.array([[1,1,1,1],[2,2,2,1],[3,3,3,1],[4,4,4,1],[5,5,5,1]])

Then, we change the dimensionality of W from 3 x 3 to 4 x 3 by adding the bias vector as the extra row.

Making our weights matrix:

     
1 2 3
4 5 6
7 8 9
3 5 2
weights = np.array([[1,2,3],[4,5,6],[7,8,9],[3,5,2]])

We can then calculate the scores by doing:

scores = np.dot(weights.T, input_x.T)

The scores will be:

         
15 27 39 51 63
20 35 50 65 80
20 38 56 74 92
np.array([[15,27,39,51,63],[20,35,50,65,80],[20,38,56,74,92]])

Quick breakdown:

  • np.dot is used for matrix multiplication, and we need to transpose the matrices to get the proper dimensions, weights.T will have shape 3x4, and input_x.T will have shape 4x5, thus giving us scores with shape 3x5 which is what we want.

Now that we understand how to calculate the scores, we need to determine if the weights that we chose are doing a good job of classifying our data.

Loss function:

We use something called a loss function to determine if our weights are good, so if we’re doing a good job of classifying our data with the weights, then the loss will be low.

Intuitively, what we want is to have the score of the correct class be the highest. If the correct class score is lower than any other scores returned, then there should be loss. Also, to help us get scores that lean towards the proper class, we actually want there to be loss if the correct class score is not higher than an incorrect class score by a certain margin. The certain margin we choose is just a constant, for this example we’ll choose a margin of 1.

Therefore as part of our loss function, we want something like:

correct_scores = np.ones(scores.shape) * scores[labels, np.arange(0, scores.shape[1])]

Quick breakdown of code:

  • np.arange(0, scores.shape[1]) will return an array [0, 1, 2, 3, 4]
  • labels is [0, 0, 1, 1, 2]
  • [labels, np.arange(0, scores.shape[1])] will go to locations (0,0), (0,1), (1,2), (1,3), (2,4)
  • scores[labels, np.arange(0, scores.shape[1])] will return an array [15, 27, 50, 65, 92]
  • Note the use of * , this does element-wise multiplication

Thus correct_scores is:

         
15 27 50 65 92
15 27 50 65 92
15 27 50 65 92
np.array([[15,27,50,65,92],[15,27,50,65,92],[15,27,50,65,92]])

We have a 5 x 3 matrix, where each column consists of the value of the correct class score for that given training example.

For our margin, we will create a ones matrix of the same shape as scores:

delta = np.ones(scores.shape)

Then we do:

loss = scores - correct_scores + delta

This will give us a matrix:

         
1 1 -10 -13 -28
6 9 1 1 -11
6 12 7 10 1
np.array([[1,1,-10,-13,-28],[6,9,1,1,-11],[6,12,7,10,1]])

We don’t want to add any negative loss to the overall loss because if the loss is negative, that means the correct class score is greater than the given incorrect class score by the margin that we want.

We can do this by:

loss[loss < 0] = 0

Making loss:

         
1 1 0 0 0
6 9 1 1 0
6 12 7 10 1
np.array([[1,1,0,0,0],[6,9,1,1,0],[6,12,7,10,1]])

Also, we should only have loss if its from an incorrect class, since we don’t want loss from the correct class.

We do:

loss[labels, np.arange(0, scores.shape[1])] = 0

Giving us:

         
0 0 0 0 0
6 9 0 0 0
6 12 7 10 0
np.array([[0,0,0,0,0],[6,9,0,0,0],[6,12,7,10,0]])

Then we want to get the average of all the losses.

loss_sum = np.sum(loss)
loss_sum /= input_x.shape[0]

We also want to add regularization to our loss function to ensure we have the smallest possible weights, which will help prevent overfitting. We’ll use the L2 norm here.

loss_sum += regularization_factor * np.sum(W * W)

For our example we’ll use .00001 as the regularization_factor.

The loss that we developed is known as the Support Vector Machine (SVM) loss.

Optimizing weights:

To help us determine how to change our weights, we’ll use the gradient. The gradient allows us find the direction of steepest descent.

The gradient of the lost function we previously defined can be calculated by doing the following:

For rows of weights that correspond to the correct classes, we differentiate it by counting the number of classes that had a contribution to the overall loss, the number of incorrect classes whose score was not less than the correct class score by the set margin. After getting the number, we multiply it with -1 and the input x to get the gradient with respect to rows of correct classes.

The gradient with respect to the incorrect classes is simply the indicator function, whether or not the class met the margin, times the input_x

To calculate the gradient, we need to go back to our loss matrix.

loss = scores - correct_scores + delta

Which is:

         
1 1 -10 -13 -28
6 9 1 1 -11
6 12 7 10 1
np.array([[1,1,-10,-13,-28],[6,9,1,1,-11],[6,12,7,10,1]])

Then, we use an indicator function to determine which classes met the margin and which didn’t:

loss[loss > 0] = 1 # We make this one because it contributed to the loss, and didn’t meet the margin
loss[loss < 0] = 0 # These do not contribute to the loss

This gives us:

         
1 1 0 0 0
1 1 1 1 0
1 1 1 1 1
np.array([[1,1,0,0,0],[1,1,1,1,0],[1,1,1,1,1]])

Then we set the values at indices of the correct classes to be 0.

loss[labels, np.arange(0, scores.shape[1])] = 0

Giving us:

         
0 0 0 0 0
1 1 0 0 0
1 1 1 1 0
np.array([[0,0,0,0,0],[1,1,0,0,0],[1,1,1,1,0]])

This sets all the indices with the correct class to 0, since we want to count number of classes that contributed to the loss, and the correct classes do not contribute to the loss.

The next step is to count number of incorrect classes that did not meet the margin in the columns, and save it into the index of the correct class. Note this is part of calculating the gradient with respect to rows of the correct class.

loss[labels, np.arange(0, scores.shape[1])] = -1 * np.sum(loss, axis=0)

Quick breakdown of code:

  • -1 * np.sum(L, axis=0) returns an array [-2,-2,-1,-1,0] by summing along the columns
  • By using loss[labels, np.arange(0, scores.shape[1])], we are able to put the above array in the locations (0,0),(0,1),(1,2),(1,3),(2,4).

Thus we get:

         
-2 -2 0 0 0
1 1 -1 -1 0
1 1 1 1 0
np.array([[-2,-2,0,0,0],[1,1,-1,-1,0],[1,1,1,1,0]])

Notice that we have 1’s and 0’s in the locations of the incorrect classes, the earlier indicator function step we took earlier already handled what we needed to get the gradient with respect to rows of the incorrect class.

Now we matrix multiply the loss and the input_x.

gradient = np.dot(loss, input_x)

We get the gradient matrix to be:

       
-6 -6 -6 -4
-4 -4 -4 0
10 10 10 4
np.array([[-6,-6,-6,-4],[-4,-4,-4,0],[10,10,10,4]])

Then we want to normalize the gradient by dividing by the number of examples, which is the same as input_x.shape[0]

gradient /= input_x.shape[0]
       
-1.2 -1.2 -1.2 -0.8
-0.8 -0.8 -0.8 0
2 2 2 0.8
np.array([[-1.2,-1.2,-1.2,-0.8],[-0.8,-0.8,-0.8,0],[2,2,2,0.8]])

We now need to transpose our gradient so we can perform operations using the weights matrix with it.

gradient = gradient.T
     
-1.2 -0.8 2
-1.2 -0.8 2
-1.2 -0.8 2
-0.8 0 0.8
np.array([[-1.2,-0.8,2],[-1.2,-0.8,2],[-1.2,-0.8,2],[-0.8,0,0.8]])

Then we need to add 2regularization_constantW to our gradient because of the L2 norm in the original loss.

gradient += 2*regularization_factor*weights

Updating weights:

To update our weights from the gradient we simply do:

weights += -1 * learning_rate * gradient

The learning rate is usually a small constant ~.001, the learning rate controls the “step size” that we take, so basically if the learning rate is larger, we take larger steps and can find the optimal weights faster, however with a large step size, we can easily skip over the optimal weights. Therefore, we usually use .001 as a starting point and try different learning rates from there.

Written on May 20, 2017