Backpropagation
Gradients
We can use to chain rule to derive gradients, which is just a vector of partial derivatives as mentioned in a previous post.
The derivative tells us the rate of change of a function with respect to a certain variable at a particular point.
We can use this to understand the effects of variables on an expression.
Example 1:
f(x,y) = xy ; Given x = 5, y = -10.
f(x,y) = -50.
df/dx = -10
* This tells us that if we increase the value of x, the result of the expression will decrease by ten times that amount
df/dy = 5
* This tell us that if we increase the value of y, the result of the expression will increase by five times that amount
Example 2
Backpropagation is pretty much just the chain rule, which helps us get the gradient.
f(x,y,z) = (x + 2y)z
We can substitute q = (x + 2y) to get:
f(x,y,z) = qz
Then we can get:
df/dq = z
df/dz = q
dq/dx = 1
dq/dy = 2
However df/dx and df/dy are important to us, not df/dq
We apply the chain rule to get the derivative of f with respect to x and y.
df/dx = (df/dq)(dq/dx) = z * 1
df/dy = (df/dq)(dq/dy) = z * 2
Example 3
Consider the sigmoid function:
f(n) = 1 / (1 + e^-n), with n = (np.array([w0,w1,w2]).dot(np.array[x0,x1,1])) or w0x0 + w1x1 + w2
df/dn = (e^-n) / (1 + e^-n)^2
= ((1 + (e^-n) - 1) / (1 + e^-n)) * (1 / 1 + e^-n)
= (1 - f(n))f(n)
df/dx = [ w0 * df/dn, w1 * df/dn ]
df/dw = [ x0 * df/dn, x1 * df/dn, 1 * df/dn ]
Using backpropagation, we can calculate the gradients needed to update our parameters
Written on August 12, 2017