Sorry folks, going to take a, perhaps lengthy, side-trip. When trying to sort the code for training the glasses or no-glasses conditional GAN, I got side-tracked by a lack of understanding. Which then led me on a much more lengthy and convoluted trip than I originally envisioned. But, I believe it is in fact a much needed voyage. So I am going to share it with you best I can.

This is pretty much a copy of the post: Backpropagation — Chain Rule and PyTorch in Action by Robert Kwiatkowski. Though I have made an effort to write it in my own words. And made a small change in an attempt to confirm I was understanding what he was saying.

The thing that got me going was the call to .backward() on the loss value we were generating. That is Wasserstein distance with a gradient penalty. Previously I was using loss functions/classes provided by PyTorch. So I assumed they somehow provided the method to perform the back-propagation on the networks during training. And as we were calculating the loss value in our code, I knew there was no backward() method in our loss calculation code. Turns out that backward() is a much more general method that can be applied to virtually any tensor (given some limitations/restrictions). That led me to investigate how back-propagation worked. Which included looking at the forward propagation process. And at gradients. And at computational graphs. And at…

The development of the next project, a Cycle GAN, is just going to have to wait a bit. Maybe a long bit. Not sure how many posts this is going to take. But, I hope you find it as worthwhile as I did.

Don’t currently have clue how to go about taking you on this journey. But, perhaps we can start with the forward pass and computational graphs.

Propagation

The training of Neural Networks based on gradient-based optimization has two major steps:

forward propagation — calculate the network output(s) given its input(s)
backward propagation — calculate the gradients of the output(s) with respect to the input(s) too allow for use in a gradient descent optimizer (algorithm)

The first step is typically pretty straightforward. The concept behind the second step is also perhaps understandable. Gradients are required to determine the direction and size of the steps in the gradient descent optimization algorithm.

For most of us understanding how these gradients are calculated can be difficult. Notwithstanding because you have to use some calculus. In particular, partial derivatives and the chain rule.

In PyTorch, propagation is built in. Which in the case of back propagation includes automatic differentiation via computational graphs. We are mainly talking about backward propagation and PyTorch’s engine Autograd. That’s what makes the tensors in PyTorch different from other arrays. It includes, for that tensor, everything needed to generate its related gradients. Well, if PyTorch has been told to do so.

Computational Graph

For better or for worse I am going to start with a relatively simple computational graph to hopefully show, in an understandable fashion, how forward and backward propagation works. Then I hope to move on to a simple network and do the same. I will see what I can get PyTorch to show us in both cases.

The computational graph below charts the equation $z^2 * (x + y^2)^2$. Which is equivalent to $(z*(x + y^2))^2$.

So how do we get this into a computational map?

input x and assign it to a variable, u
input y and assign it to a variable, v, consisting of its square summed with u
input z and assign it to a variable, w, consisting of it multpilied with v
calculate the final result, r, by squaring w

We need all those intermittment variables so that we can track the calculations involved. And then use them to generate the information necessary to produce the gradients we are after on the backward pass. Those intermittent variables are, clearly, function definitions; which can be used to generate partial derivatives. Those partial derivatives will be used to calculate the gradients for each input.

For confirmation lets calculate the result of both versions: the equation and the graph.

The equation

$$ z^2(x + y^2)^2 \space\boldsymbol{\rightarrow}\space z^2(x^2 + 2xy^2 + y^4) \space\boldsymbol{\rightarrow}\space (z^2x^2 + 2z^2xy^2 + z^2y^4) $$

The graph

$$ x \space\boldsymbol{\rightarrow}\space x \space\boldsymbol{\rightarrow}\space x + y^2 \space\boldsymbol{\rightarrow}\space zx + zy^2 \space\boldsymbol{\rightarrow}\space\space\space\space (z^2x^2 + 2z^2xy^2 + z^2y^4) $$

Forward Propogation

Let’s use $x = 1,\space y = 2,\space z = 4$. And we get:

$$ u = x \space\boldsymbol{\rightarrow}\space u = 1 \scriptsize{\space\space\space\text{ and}}$$

$$v = u + y2 \space\boldsymbol{\rightarrow}\space v = 1 + 2^2 \space\boldsymbol{\rightarrow\space} v = 5 \scriptsize{\space\space\space\text{ and}}$$

$$w = zv \space\boldsymbol{\rightarrow}\space w = 4 * 5 \space\boldsymbol{\rightarrow}\space w = 20 \scriptsize{\space\space\space\text{ and}}$$

$$r = w^2 \space\boldsymbol{\rightarrow}\space r = 20^2 \space\boldsymbol{\rightarrow}\space r = 400$$

Back Propagation

Time to look at calculating gradients. We will be using partial differentials and the chain rule to eventually sort this out.

This step (process) has a few different labels over and above the one I used in the section heading. The one I liked best was reverse mode of automatic differentiation. Automatic differentiation being involved in the name AutoGrad and how it works.

Sorry, more textual math. Hopefully with enough descriptiion to allow you to sort things out. But there is a drawing further down.

Partial Derivatives

`r`

The result is only a function of w. So, the partial derivative of r wrt to w is:

$$ \frac{\partial r}{\partial w} = \frac{\partial w^2}{\partial w} = 2w$$

`w`

w is a function of two variables z and v, so two partial derivatives to look at.

$$ \frac{\partial w}{\partial v} = \frac{\partial zv}{\partial v} = z$$

$$ \frac{\partial w}{\partial z} = \frac{\partial zv}{\partial z} = v$$

`v`

v is also a function of two variables, y and u.

$$ \frac{\partial v}{\partial u} = \frac{\partial (u + y^2)}{\partial u} = 1$$

$$ \frac{\partial v}{\partial y} = \frac{\partial (u + y^2)}{\partial y} = 2y$$

`u`

u is only a function of the input x.

$$ \frac{\partial u}{\partial x} = \frac{\partial x}{\partial x} = 1$$

computational graph with partial derivatives for equation above

Gradients

Now for each of the inputs we can use the chain rule to get the appropriate derivative/gradient. I am doing them in reverse order to perhaps more clearly show how each chain builds up to get the desired result. You know, small steps.

$$ \frac{\partial r}{\partial z} = \frac{\partial r}{\partial w}\frac{\partial w}{\partial z} = 2wv = 2*20*5 = 200$$

$$ \frac{\partial r}{\partial y} = \frac{\partial r}{\partial w}\frac{\partial w}{\partial v}\frac{\partial v}{\partial y} = 2w*z*2y = 2*20*4*2*2 = 640$$

$$ \frac{\partial r}{\partial x} = \frac{\partial r}{\partial w}\frac{\partial w}{\partial v}\frac{\partial v}{\partial u}\frac{\partial u}{\partial x} = 2w*z*1*1 = 2*20*4 = 160$$

These, in a machine learning context, would then be used for optimization with gradient descent (in our case we have so far been mostly using the Adam optimizer for that purpose).

PyTorch Implementation

Let’s see if we can get Pytorch to double check our results.

Autograd is a reverse automatic differentiation system. Conceptually, autograd records a graph recording all of the operations that created the data as you execute operations, giving you a directed acyclic graph whose leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

How autograd encodes the history

We need to make sure we tell PyTorch to track gradient information on our input variables. I.E. on each input tensor we must set requires_grad=True. If we did not, there would be nothing for backward() to operate on and no gradients would be calculated.

Otherwise, this is pretty simple code. And, surprisingly, I actually found it somewhat enlightening.

# backprop_1.py: use PyTorch to generate gradients for simple computational map used in blog post

import torch

# instantiate input tensors, and tell PyTorch to track gradients
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = torch.tensor(4.0, requires_grad=True)

# forward pass, i.e. calculate the result
u = x
v = u+y**2
w = z*v
r = w**2
print(f"r = {r}")

# backward pass, generate gradients, same order as in related post
r.backward()
print(f"dr/dz = {z.grad}")
print(f"dr/dy = {y.grad}")
print(f"dr/dx = {x.grad}")

And, the module output matches the values calculated manually above.

(mclp-3.12) PS F:\learn\mcl_pytorch\rek_1> python backprop_1.py
r = 400.0
dr/dz = 200.0
dr/dy = 640.0
dr/dx = 160.0

For fun I dropped the requires_grad=True from each tensor instantiation. Running the module generated the following error. Though the forward pass still generated the correct output.

(mclp-3.12) PS F:\learn\mcl_pytorch\rek_1> python backprop_1.py
r = 400.0
Traceback (most recent call last):
  File "F:\learn\mcl_pytorch\rek_1\backprop_1.py", line 21, in <module>
    r.backward()
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Done

Certainly a longer post than I anticipated. Well, longer in the sense that it took a fair bit of time to prepare and put together (images as well as content and symbolic math—using MathJax engine and LaTex).

That said, I quite enjoyed getting this sorted and documented (with help from many posts/tutorials). And I believe doing so has in some small way improved my wee bit of understanding regarding generating gradients. Doesn’t really help with my understanding of gradient descent but small steps.

And I was certainly pleased when PyTorch confirmed my manual arithmetic.

Until next time, I do hope you find as much joy in your efforts.

Resources

Autograd mechanics
torch.Tensor.backward
Confusion about loss.backward() and how it updates gradients
Gradients
Tensor Gradients and Optimization
Gradient Descent and AutoGrad
connection between loss.backward() and optimizer.step()
How does the training process in PyTorch work?
Training Neural Networks
Neural Networks: Forward pass and Backpropagation
Backpropagation — Chain Rule and PyTorch in Action
Gradient Descent Algorithm — a deep dive
Activation Functions and When to Use Them?
Use of retain_graph = True
Understanding Graphs, Automatic Differentiation and Autograd

Too Old To Code

MCL with Pytorch: Side Trip — Gradients, Back-propagation, etc.