[Estimated time: 1 hour]
Explain how the training of a neural network would change if the chain rule was instead formulated as h(x)= f(g(x)) implies that h'(x) = g'(f(x))f'(x), rather than the
standard chain rule formula (assume all functions are 1-to-1 correspondences from
ℝ → ℝ, differentiable and continuous everywhere on ℝ). How would this alternative formulation affect the backpropagation algorithm and the way gradients are computed?
To be more concrete, consider the following example:
f(x, y, z) = (x+y)z
Evaluated at x = -3, y = 9, z = -2
A.Calculate the gradient of f with respect to x, y, and z at the specified evaluation point using the standard backpropagation algorithm (no modifications). Show your work. It would be very helpful to draw out the computational graph in this case.
B. Use the modified chain rule to compute the gradient of f with respect to x, y, and z. Show your work
C. Describe a high level algorithm to compute gradients using this modified chain rule. (No need to list exact steps and variables, be more high level). Think about what intermediates we have to cache. How can we efficiently compute those intermediates?
Hint 1:
These slides (starting from slide 42) may also be helpful to understand backpropagation.
Note: You may upload a PDF with your work and share the link here if you find that easier. It's also fine to write your answer in the text box directly.
Rubric:
This is probably the hardest question here, so we know that not everyone will correctly answer all the sub-parts. We do expect most successful applicants to correctly answer A and B though. Correct answering C shows an exceptional applicant.