Neural Network
Introduction
From the post on linear classification, for the score function we developed:
For a neural network, we simply add an extra layer and a non-linearity function (we will use ). We transform our score function from to . In our example, let’s assume that there are 5 classes that we want to classify from. Our input x, has dimensions 25x1, with 25 features. Our matrix has dimensions 88x25, and creates an intermediate vector of 88x1. The non-linearity function is then applied to each element of the vector. Finally, is of dimensionality 5x88, which gives us a score for each class. The point of the non-linearity is so that the two matrices can’t be joined into a single matrix. For our weights and , we calculate the gradients using backpropagation, and train the weights using gradient descent.
We can continously repeat this process to add more layers.
A 5 layer neural network for example:
Implementation
Code for two layer feed-forward neural network:
f = lambda n: max(0,n)
x = np.random.randn(25, 1) # input vector of dimension (25 x 1)
hidden_layer_one = f(np.dot(Wa, x) + ba) # first layer activations, Wa.shape = (88 x 25), ba.shape = (88 x 1), layer has dimensional (88 x 1)
output = f(np.dot(Wb, x) + bb) # output layer, Wb.shape = (5 x 88), ba.shape = (5 x 1), layer has dimensional (5 x 1), class with max score is the prediction of the network
We can learn Wa, ba, Wb, and bb to get better predictions from the network.
Preprocessing Data
Center data to have mean zero by subtracting the mean in each dimension
X is [N x D] ; N is number of data, D is their dimensionality
X = np.array([])
X -= np.mean(X, axis = 0) # X is now ([])
Normalize to scale [-1, 1], min and max along the dimension is -1 and 1
for each in
X = 2 * ((X - X.min()) / ( X.max() - X.min() )) - 1
Initializing Weights
According to: https://arxiv.org/abs/1502.01852
We should get weights from a normal distribution with standard deviation of sqrt(2/n), where n is the # of inputs of the neuron
w = np.random.randn(n) * sqrt(2.0/n)
Regularization
For each weight matrix , add .
The regularization makes sure we have the smallest weights possible to prevent overfitting.
Remember to account for the L2 regularization when updating weights by doing.
W += -lambda * W
Dropout
# Perform dropout at each layer of calculating forward pass
p = 0.5 # Lower means less dropout, higher means more dropout
A = np.array([[1,2,3],[4,5,6]])
B = np.random.rand() < p # * is used to unpack shape tuple ; ex. np.random.rand(*(2,3)) becomes np.random.rand(2,3), this overall returns a boolean array, where an entry is true if its less than p.
A *= B # dropout by making values zero; making neurons inactive
When predicting we won’t use dropout, and we must scale by the p used in the training. We scale by p in predicting because we multiply the outputs by p in training, and we want to expect same outputs in prediction as in training In training, we get px + (1-p)0 as the expected output, so for prediction we want to scale x to px. Simply multiple result of a forward pass by p
Inverted dropout for better performance during testing, scale during training!
To invert simply do:
# Perform dropout at each layer of calculating forward pass
p = 0.5 # Lower means less dropout, higher means more dropout
A = np.array([[1,2,3],[4,5,6]])
B = (np.random.rand() < p) / p # * is used to unpack shape tuple ; ex. np.random.rand(*(2,3)) becomes np.random.rand(2,3), this overall returns a boolean array, where an entry is true if its less than p.
A *= B # dropout by making values zero; making neurons inactive
Here we divide matrix B by p to scale during training
Batch normalization
https://arxiv.org/abs/1502.03167
Loss Functions
The loss is the average of the losses for each example
SVM Cost Function: (Good for classification)
L2 Squared Norm: (Good for predicting real-valued quantities)
Training a neural network
Gradient Check:
Centered Difference Formula:
Relative Error:
Double Precision: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Regularization Check: Increase regularization strength to see if the regularization is affecting the data loss in the gradient check.
Dropout: Turn off dropouts in gradient check