Gradient Descent

Introduction

The gradient can basically be thought of as just a vector of partial derivatives. The partials are taken with respect to the weights of each class.

Partial Derivatives

To understand gradients we must first understand partials.

A partial derivative is simply when a function has multiple variables, and the derivative is taken with respect to one of the variables while treating all the other ones as a constant.

For example

f(x,y) = 2x + y
f_x : The partial with respect to x is 2
f_y : The partial with respect to y is 1

The gradient is simply a vector of all the partial derivatives of a function.

So the gradient of the given f(x,y) is [2, 1]

Chain Rule

Understanding chain rule,

Given that f(x) and g(x) are differentiable, the derivative of f(g(x)) is g’(x) * f’(g(x))

For example

Given that f(x) = sin(10x^2)

f’(x) = 20x * cos(10x^2)

Applying to non-linearities

Now, let’s try to understand the derivative of the non-linearities ReLU, Sigmoid, and Tanh

ReLU: max(0,x)

For a single input, we defined the loss as : sum( max(0, weight_incorrect_classx + weight_correct_classx + 1)) for each incorrect class

By adopting the convention that the derivative of 0 is 0, we get an indicator function for the derivative of the ReLU function, where if f(x) <= 0, then f’(x) = 0 and if f(x) = x, then f’(x) = 1.

Then using chain rule, we take the partial of weight_incorrect_classx + weight_correct_classx + 1 with respect to the weights of the correct class and the weights of the incorrect class.

We see that for both, the partial is simply x.

Therefore for the correct class we get :

-1 * sum(indicator_fn(weight_incorrect_class*x - weight_correct_class*x + 1)x ) for all incorrect classes

We first get the indicator function from the ReLU, then with the chain rule,we take the derivative of the inner function to get -x because weight_incorrect_class*x can be treated as a constant.

The summation is due to the fact that there exists one function for each of the incorrect classes.

For incorrect classes, we get :

indicator_fn(weight_incorrect_class*x - weight_correct_class*x + 1)x

Sigmoid:

Tanh:

Written on June 17, 2017