Backpropagation

Gradients

We can use to chain rule to derive gradients, which is just a vector of partial derivatives as mentioned in a previous post.

The derivative tells us the rate of change of a function with respect to a certain variable at a particular point.

We can use this to understand the effects of variables on an expression.

Example 1:

f(x,y) = xy ; Given x = 5, y = -10.

f(x,y) = -50.

df/dx = -10
* This tells us that if we increase the value of x, the result of the expression will decrease by ten times that amount

df/dy = 5
* This tell us that if we increase the value of y, the result of the expression will increase by five times that amount

Example 2

Backpropagation is pretty much just the chain rule, which helps us get the gradient.

f(x,y,z) = (x + 2y)z

We can substitute q = (x + 2y) to get:

f(x,y,z) = qz

Then we can get:

df/dq = z
df/dz = q

dq/dx = 1
dq/dy = 2

However df/dx and df/dy are important to us, not df/dq

We apply the chain rule to get the derivative of f with respect to x and y.

df/dx = (df/dq)(dq/dx) = z * 1
df/dy = (df/dq)(dq/dy) = z * 2

Example 3

Consider the sigmoid function:

f(n) = 1 / (1 + e^-n), with n = (np.array([w0,w1,w2]).dot(np.array[x0,x1,1])) or w0x0 + w1x1 + w2
df/dn = (e^-n) / (1 + e^-n)^2
      = ((1 + (e^-n) - 1) / (1 + e^-n)) * (1 / 1 + e^-n)
      = (1 - f(n))f(n)

df/dx = [ w0 * df/dn, w1 * df/dn ]
df/dw = [ x0 * df/dn, x1 * df/dn, 1 * df/dn ]

Using backpropagation, we can calculate the gradients needed to update our parameters

Written on August 12, 2017