Neural Network

Introduction

From the post on linear classification, for the score function we developed:

For a neural network, we simply add an extra layer and a non-linearity function (we will use ). We transform our score function from to . In our example, let’s assume that there are 5 classes that we want to classify from. Our input x, has dimensions 25x1, with 25 features. Our matrix has dimensions 88x25, and creates an intermediate vector of 88x1. The non-linearity function is then applied to each element of the vector. Finally, is of dimensionality 5x88, which gives us a score for each class. The point of the non-linearity is so that the two matrices can’t be joined into a single matrix. For our weights and , we calculate the gradients using backpropagation, and train the weights using gradient descent.

We can continously repeat this process to add more layers.

A 5 layer neural network for example:

Implementation

Code for two layer feed-forward neural network:

f = lambda n: max(0,n)
x = np.random.randn(25, 1) # input vector of dimension (25 x 1)
hidden_layer_one = f(np.dot(Wa, x) + ba) # first layer activations, Wa.shape = (88 x 25), ba.shape = (88 x 1), layer has dimensional (88 x 1)
output = f(np.dot(Wb, x) + bb) # output layer, Wb.shape = (5 x 88), ba.shape = (5 x 1), layer has dimensional (5 x 1), class with max score is the prediction of the network

We can learn Wa, ba, Wb, and bb to get better predictions from the network.

Preprocessing Data

Center data to have mean zero by subtracting the mean in each dimension

X is [N x D] ; N is number of data, D is their dimensionality

X = np.array([])
X -= np.mean(X, axis = 0) # X is now ([])

Normalize to scale [-1, 1], min and max along the dimension is -1 and 1

for each in

X = 2 * ((X - X.min()) / ( X.max() - X.min() )) - 1

Initializing Weights

According to: https://arxiv.org/abs/1502.01852

We should get weights from a normal distribution with standard deviation of sqrt(2/n), where n is the # of inputs of the neuron

w = np.random.randn(n) * sqrt(2.0/n)

Regularization

For each weight matrix , add .

The regularization makes sure we have the smallest weights possible to prevent overfitting.

Remember to account for the L2 regularization when updating weights by doing.

W += -lambda * W

Dropout

# Perform dropout at each layer of calculating forward pass
p = 0.5 # Lower means less dropout, higher means more dropout
A = np.array([[1,2,3],[4,5,6]])
B = np.random.rand() < p # * is used to unpack shape tuple ; ex. np.random.rand(*(2,3)) becomes np.random.rand(2,3), this overall returns a boolean array, where an entry is true if its less than p.
A *= B # dropout by making values zero; making neurons inactive

When predicting we won’t use dropout, and we must scale by the p used in the training. We scale by p in predicting because we multiply the outputs by p in training, and we want to expect same outputs in prediction as in training In training, we get px + (1-p)0 as the expected output, so for prediction we want to scale x to px. Simply multiple result of a forward pass by p

Inverted dropout for better performance during testing, scale during training!

To invert simply do:

# Perform dropout at each layer of calculating forward pass
p = 0.5 # Lower means less dropout, higher means more dropout
A = np.array([[1,2,3],[4,5,6]])
B = (np.random.rand() < p) / p # * is used to unpack shape tuple ; ex. np.random.rand(*(2,3)) becomes np.random.rand(2,3), this overall returns a boolean array, where an entry is true if its less than p.
A *= B # dropout by making values zero; making neurons inactive

Here we divide matrix B by p to scale during training

Batch normalization

https://arxiv.org/abs/1502.03167

Loss Functions

The loss is the average of the losses for each example

SVM Cost Function: (Good for classification)

L2 Squared Norm: (Good for predicting real-valued quantities)

Training a neural network

Gradient Check:

Centered Difference Formula:

Relative Error:

Double Precision: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Regularization Check: Increase regularization strength to see if the regularization is affecting the data loss in the gradient check.

Dropout: Turn off dropouts in gradient check

TODO

Written on August 21, 2017