~~I am not sure if this will actually get completed or published.~~ I am planning to repeat the last post using a simple neural network. Probably one hidden layer, maybe two. Simple linear layer(s) with an activation function. I will likely use sigmoid for the activation function. Thought about using ReLU; but, apparently, it should not be used as the final output layer for classification or regression. My current thinking is an input, one hidden layer (with two nodes), an output and a loss function. Likely MSE (mean squared error) for the latter.

And, this isn’t about anything other than understanding how this back propagation thing works. And, maybe, if I get there, a little bit about optimization using gradient descent.

The Network

image of simple network being used for this post's discussion

The two nodes in the linear layer are labelled $u$ and $v$. The two Sigmoid nodes are labelled $\sigma(u)$ and $\sigma(v)$ respectively. $X$ is the input and $R$ is the network’s prediction for $y$. The equations for the Sigmoid activation layer and the network follow.

Sigmoid

$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

The Network

$$ R = w_3 * \sigma(w_1 * x + b_1) + w4 * \sigma(w_2 * x + b_2) + b_3 $$

$\mathcal{L}$ is the loss for a given $x$ & $y$ and the network’s estimate of $y$, i.e. $R$.

Forward Propagation

Here’s the above image, showing the weights, biases and inputs for the forward pass on the network. I am too lazy to try and create an image showing the results. So, textual equations it will be. (I expect that will be true for backward propagation as well.)

$$ u = X * w_1 + b_1 = 1 * 1 + 2 = 3 $$ $$ v = X * w_2 + b_2 = 1 * -1 + 3 = 2 $$

$$ \sigma(u) = \frac{1}{1 + e^{-u}} = \frac{1}{1 + e^{-3}} = 0.95257412682243 $$

$$ \sigma(v) = \frac{1}{1 + e^{-v}} = \frac{1}{1 + e^{-2}} = 0.88079707797788 $$

$$ R = w_3 * \sigma(u) + w_4 * \sigma(v) + b_3 \\\space\space\space\space = 2 * 0.95257412682243 + 3 * 0.88079707797788 + 4 \\ = 8.5475394875785 $$

$$ \mathcal{L} = \frac{1}{n}\displaystyle\sum_{i=1}^n(y_i - R_i)^2 \\ = (5 - 8.5475394875785)^2 = -3.5475394875785^2 \\ = 12.585036415928725 $$

Some rather large numbers. Guess my biases and weights weren’t particularly good selections. But, I wanted to keep the arithmetic fairly simple. Well, except for the sigmoid function.

Backward Progagation

Okay, on to the meat of the matter. This time we will be calculating the gradients with respect to the loss. As I understand it, the whole idea behind training a neural network is to minimize the loss.

Before we get working on this, thought I’d like to show you the derivative of the Sigmoid function. No, I didn’t work it out myself.

$$ \frac{\partial}{\partial x}\sigma(x) = \sigma(x) * (1 - \sigma(x)) $$

Note: I will be rounding the forward propagation values (to 4 decimal places) for the following.

Okay, let’s start with the partial derivatives for the parameters connected to the ouput node. I.E. $R$, $b_3$, $w_3$ and $w_4$.

$$ \frac{\partial \mathcal{L}}{\partial R} = -\frac{2}{n}\displaystyle\sum_{i=1}^n(y_i - R_i) = -2 * (5 - 8.5475) = 7.095 $$

$$ \frac{\partial \mathcal{L}}{\partial b_3} = \frac{\partial \mathcal{L}}{\partial R} * \frac{\partial R}{\partial b_3} = 7.095 * 1 = 7.095 $$

$$ \frac{\partial \mathcal{L}}{\partial w_3} = \frac{\partial \mathcal{L}}{\partial R} * \frac{\partial R}{\partial w_3} = 7.095 * \sigma(u) = 7.095 * 0.9526 = 6.7587 $$

$$ \frac{\partial \mathcal{L}}{\partial w_4} = \frac{\partial \mathcal{L}}{\partial R} * \frac{\partial R}{\partial w_4} = 7.095 * \sigma(v) = 7.095 * 0.8808 = 6.2493 $$

And continuing backward, though the activation layer does not have any network parameters, it is involved in producing the derivatives for the preceding nodes.

$$ \frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial R} * \frac{\partial R}{\partial \sigma(u)} * \frac{\partial \sigma(u)}{\partial u} * \frac{\partial u}{\partial b_1} \\\\ = 7.095 * w_3 * (\sigma(u) * (1 - \sigma(u))) * 1 = \\\\ 7.095 * 2 * (0.9526 * (1 - 0.9526)) = 14.19 * 0.0452 = 0.6414 $$ $$ \frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial R} * \frac{\partial R}{\partial \sigma(u)} * \frac{\partial \sigma(u)}{\partial u} * \frac{\partial u}{\partial w_1} \\\\ = 7.095 * w_3 * (\sigma(u) * (1 - \sigma(u))) * x = \\\\ 7.095 * 2 * (0.9526 * (1 - 0.9526)) * 1 = 14.19 * 0.0452 = 0.6414 $$ $$ \frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial R} * \frac{\partial R}{\partial \sigma(v)} * \frac{\partial \sigma(v)}{\partial v} * \frac{\partial v}{\partial b_2} \\\\ = 7.095 * w_4 * (\sigma(v) * (1 - \sigma(v))) * 1 = \\\\ 7.095 * 3 * (0.8808 * (1 - 0.8808)) = 21.285 * 0.105 = 2.235 $$ $$ \frac{\partial \mathcal{L}}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial R} * \frac{\partial R}{\partial \sigma(v)} * \frac{\partial \sigma(v)}{\partial v} * \frac{\partial v}{\partial w_2} \\\\ = 7.095 * w_4 * (\sigma(v) * (1 - \sigma(v))) * x = \\\\ 7.095 * 3 * (0.8808 * (1 - 0.8808)) * 1 = 21.285 * 0.105 = 2.235 $$

Use PyTorch to Verify Our Work

Okay, let’s let AutoGrad check our manual calculations above.

Model

We will start by defining our model class, instantiating it and setting the parameters to match the diagram above. Stuff we have mostly done before, except perhaps setting model parameters to specific values. Though the latter is easy enough. We will also create our input tensors with the appropriate $x$ and $y$ values.

# backprop_2.py: use pytorch to generate gradients for simple network used in blog post

import torch
import torch.nn as nn


# define our network
class network(torch.nn.Module):
  def __init__(self, inp_sz, hid_sz, out_sz):
    super().__init__()
    self.hide = nn.Linear(inp_sz, hid_sz)
    self.outp = nn.Linear(hid_sz, out_sz)

  def forward(self, x):
    x = self.hide(x)
    x = torch.sigmoid(x)
    x = self.outp(x)
    return x


# instantiate our simple model, and setup our biases and weights
simple = network(1, 2, 1)
simple.state_dict()['hide.weight'][:] = torch.tensor([[1], [-1]])
simple.state_dict()['hide.bias'][:] = torch.tensor([2, 3])
simple.state_dict()['outp.weight'][:] = torch.tensor([[2, 3]])
simple.state_dict()['outp.bias'][:] = torch.tensor([4])
# set up input tensor, i.e. the x, y input
x, y = torch.tensor([1.0]), torch.tensor([5.0])

# look at current model parameters
print(f"\ncurrent model parameters:\n{simple.state_dict()}")

current model parameters:
OrderedDict({
  'hide.weight': tensor([[ 1.], [-1.]]), 'hide.bias': tensor([2., 3.]),
  'outp.weight': tensor([[2., 3.]]), 'outp.bias': tensor([4.])})

Now before we do anything else let’s have a look at one of the gradients.

# have a look at the current gradients of the output weights
# should not yet have one as pytorch has not yet calculated anything
# 2D tensor, so reduce to 1D
print(f"output weight gradients: {simple.outp.weight.grad.squeeze()}")

output weight gradients: None

Forward and Backward Propagation

PyTorch is tracking the information it needs to calculate the gradients. But we haven’t yet calculated anything for it to use. So let’s take care of that by executing a forward and backward pass (propagation?). For the latter we will need to have a loss function and a loss value.

For the forward pass, just for comparison with our manually calculated value let’s print out the PyTorch value.

# so let's define our loss function and run our forward and backward passes
loss_fn = nn.MSELoss()

rslt = simple(x)
print(f"\nforward pass: {rslt}")
loss = loss_fn(rslt, y)
print(f"\nloss: {loss}")
loss.backward()

And it looks like PyTorch is tracking the gradient for the predicted $y$ value (tensor).

forward pass: tensor([8.5475], grad_fn=<ViewBackward0>)

loss: 12.585038185119629

That’s a match. Now, let’s have another look at those output weight gradients.

# check that gradient now
# 2D tensor, so reduce to 1D
print(f"\noutput weight gradients: {simple.outp.weight.grad.squeeze()}")

output weight gradients: tensor([6.7586, 6.2493])

And, forgiving differences in rounding, that’s pretty much what we calculated above. So, let’s have a look at all of the gradients as calculated by AutoGrad.

# print all model gradients
print("\nHidden layer gradients")
print(f"biases: {simple.hide.bias.grad}")
print(f"weights: {simple.hide.weight.grad.squeeze()}")
print("\nOutput layer gradients")
print(f"biases: {simple.outp.bias.grad}")
print(f"weights: {simple.outp.weight.grad.squeeze()}")

Hidden layer gradients
biases: tensor([0.6411, 2.2348])
weights: tensor([0.6411, 2.2348])

Output layer gradients
biases: tensor([7.0951])
weights: tensor([6.7586, 6.2493])

Once again, ignoring differences in rounding, exactly what we calculated above.

What next?

Gradient Descent

Let’s have a look at using gradient descent to update our model parameters.

We will need an optimizer to proceed. I was going to use Adam as we have used it in our most recent models (GANs). But, I figured SGD would likely generate values I would find easier to reproduce.

We will instantiate the optimizer, execute an optimiztion step and have another look at the new model parameters. The optimizer also has a state_dict(). But in this case, as you can see, it wasn’t particularly informative.

# need an optimizer, using stochastic gradient descent
mod_opt = torch.optim.SGD

# instantiate optimizer and run optimizer step
opt = mod_opt(simple.parameters(), lr=0.1)

opt.step()
# look at current optimize parameters
print(f"\noptimizer state:")
print(f"{opt.state_dict()}")

# look at model parameters after optimizer steop
print(f"\noptimized, 1 step, model parameters:\n{simple.state_dict()}")

optimizer state:
{'state': {0: {'momentum_buffer': None}, 1: {'momentum_buffer': None}, 2: {'momentum_buffer': None}, 3: {'momentum_buffer': None}}, 'param_groups': [{'lr': 0.1, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3]}]}

optimized, 1 step, model parameters:
OrderedDict({
  'hide.weight': tensor([[ 0.9359], [-1.2235]]), 'hide.bias': tensor([1.9359, 2.7765]),
  'outp.weight': tensor([[1.3241, 2.3751]]), 'outp.bias': tensor([3.2905])
})

Let’s see if we can calculate one of those ourselves.

We will adjust the weight for the output bias, $b_3$. The new weight is calculated as follows.

$$ \text{new output bias} = b_3 - (\frac{\partial \mathcal{L}}{\partial b_3} * lr) $$

where:
lr = optimizer learning rate

# some test arithmetic
# output bias
print(f"\nnew output bias: 4 - (7.095 * .1) = {(4 - (7.095 * .1)):.4f}")
print(f"\nor change in output bias: (4 - 3.2905) / .1 = {((4 - 3.2905) / .1):.4f}")

new output bias: 4 - (7.095 * .1) = 3.2905
or
change in output bias: (4 - 3.2905) / .1 = 7.0950

And, that’s a match.

Real World

Needless to say, in order to train a real neural network, i.e. a machine learning model, this process of forward pass, backward pass and optimization is repeated a staggeringly large number of times using the available training data. We just did it once to get an idea of how a single iteration works. And the models would have millions to billions of parameters. We had half a dozen or so.

For example, I have, while preparing this post for publication, been trying to train a CycleGAN. I am using around 1350 images of horses and zebras for each epoch of training. Each discriminator has ~2¾ million parameters. The generators have approximately 11⅓ million parameters each. The network as a whole is over 28 million parameters.

I am using a batch size of 2, which gives 667 iterations of the above process per epoch. It has so far been trained for 80 epochs, which has taken over 16 hours. And, it has still not been properly trained. Expect my model design/implementation is not quite correct.

Done

I am sure there is more we could do with this example. Such as, I did think about using another set of inputs and run the whole thing again looking at the gradients and changes to the model parameters.

But, for now I am calling it quits. Though I suppose that is always subject to change.

Until next time, I hope your learning experiments go as well as this one of mine.

Resources

torch.optim.SGD
Understanding the Derivative of the Sigmoid Function
Activation and Loss Functions for neural networks — How to pick the right one?
Loss Functions in Machine Learning Explained

Too Old To Code

MCL with Pytorch: Side Trip — Gradients & Back-propagation, Part II

The Network

Sigmoid

The Network

Forward Propagation

Backward Progagation

Use PyTorch to Verify Our Work

Model

Forward and Backward Propagation

Gradient Descent

Real World

Done

Resources