For the next project, a GAN to generate colour anime images, I will apparently need to use a convolutional neural network (CNN). That is because they are, when processing images, considerably more efficient, with respect to both time and computing resources, than the fully connected transformation layers I have been using until now. At least that’s what the web is telling me. And from what I’ve read that would appear to be true. Though how it does so is somewhat baffling.

Before tackling the colour images, I wanted to apply them to generating the clothing images we worked on in the last post. Hoping it would be faster and generate better quality images (another thing CNNs are supposed to do). But I also wanted to look at what the various layers in the CNN are doing. A little bit of understanding is better than no understanding. And so far, a very wee bit is all I have. There are many things I really don’t follow with respect to the whole network. But the individual pieces I do seem to grasp.

So, I thought I’d start by writing about those individual pieces (layers?).

Basic CNN Components

The concept of Convoluted Neural Networks is based on how human vision works. That is individual neurons detect features that occurr in a small region of our total receptive field. Neurons close to the retina detect fairly simple features. E.G. edges. Each set of neurons down the line detect more complicated features.

The network itself will have an input layer, a number of hidden layers (involving convolutional layers) ending with an output layer. The output layer is generally a fully connected layer. The hidden layers will contain a convolutional layer and an activation layer. They could also include pooling layers and/or normalization layers. But I don’t expect I will need those for the test implementation.

I won’t look at the activation or normalization layers in any detail. Suffice it to say, the activation layer helps the network learn non-linear relationships between the features in the image. It also helps mitigate the vanishing gradient problem. (Again, so the web says.)

Normalization layers are used to reduce the chance of model overfitting—regularization. Overfitting produces models that perform extremely well on the training data, but poorly on new data. It is also used to speed up the training process. Pooling layers also reduce the chance of overfitting.

Check the resource section at the end of the post for a number of tutorials/explanations.

Convolutional Layer

The convolutional layer is the fundamental buidling block of a CNN. Its name comes from the mathematical concept of convolution. Essentially the application of a function using a smaller sliding window (matrix) over an input matrix. In a CNN that sliding function is referred to as a filter or, more correctly, a kernel. At each convolutional layer any number of kernels can be applied to the input matrices.

“Convolution in the time domain is multiplication in the frequency (Fourier) domain.”
Intuitive Guide to Convolution by Kalid Azad

Nice quote. But I’d hardly call it intuitive. Guess it would help to have a higher degree in mathematics or perhaps engineering.

Depending on the size of that sliding window, padding and the size of its stride as it slides across the input matrix, the output matrix can be the same size or smaller. I am not good enough at imaging software to try and create an animation of the process. So, I am going to use a series of HTML tables to provide an idea of how the convolution works. (A little more post real estate, but much easier/simpler for my skill level.)

To make things even easier the input matrix and the kernel are going to be relatively small. But I do eventually plan to show the result of applying one or more kernels on a real image (I hope!?).

The kernel starts at the top right, calculates a value for the related position in the output matrix. That value being the sum of the each kernel element times the input matrix value currently below it. It then takes a step (of some stride size) to the right and calculates the next output value. And, continues to the right until it can no longer move right without a column of the kernel being outside the input matrix. Though padding can affect that last bit of movement.

At that point it moves all the way back to the left and steps downwards, using the same stride size. Repeat until the matrix has been traversed.

Stride and Padding

Stride is the number of cells/elements/pixels the filter moves each time it is applied to the input data. The example below uses a stride of one. But larger stride sizes are permitted and used. In our case the output matrix is smaller than the input matrix. We can make that output matrix somewhat bigger by using padding. Padding is the addition of a border or borders of numbers around the input matrix thereby increasing its size. The number is typically based on the size of the input and the size of the desired output. And, it only makes sense to use a padding size that is smaller than the dimension of the filter matrix.

The stride, padding and kernel size all affect the size of the output vector. Note, as far as I can tell, inputs and kernels are expected to be square.

$$n_{out} = \left[\frac{n_{in} + 2p - k}{s}\right] + 1$$

where: \(n_{out}\) is our output matrix dimension size,
\(n_{in}\) is the input matrix dimension size,
\(p\) is the convolution padding size,
\(k\) is the kernel dimension size, and
\(s\) is stride size

In the example below—input 5x5, kernel 3x3, stride 1, no padding:

$$n_{out} = \left[\frac{5 + 2*0 - 3}{1}\right] + 1 = 3$$

Example Convolution

507711194109
28801004076
85796374114
675599105123
11996735830
Kernel
0-10
-15-1
0-10

+ 50 * 0 + 77 * -1 + 111 * 0
+ 28 * -1 + 80 * 5 + 100 * -1
+ 85 * 0 + 79 * -1 + 63 * 0
= 116

116000000
         
         
507711194109
28801004076
85796374114
675599105123
11996735830
Kernel
0-10
-15-1
0-10

+ 77 * 0 + 111 * -1 + 94 * 0
+ 80 * -1 + 100 * 5 + 40 * -1
+ 79 * 0 + 63 * -1 + 74 * 0
= 206

116206+000
         
         
507711194109
28801004076
85796374114
675599105123
11996735830
Kernel
0-10
-15-1
0-10

+ 111 * 0 + 94 * -1 + 109 * 0
+ 100 * -1 + 40 * 5 + 76 * -1
+ 63 * 0 + 74 * -1 + 114 * 0
= -144

116206-144
         
         
507711194109
28801004076
85796374114
675599105123
11996735830
Kernel
0-10
-15-1
0-10

+ 28 * 0 + 80 * -1 + 100 * 0
+ 85 * -1 + 79 * 5 + 63 * -1
+ 67 * 0 + 55 * -1 + 99 * 0
= 112

116206-144
112      
         

... ...

507711194109
28801004076
85796374114
675599105123
11996735830
Kernel
0-10
-15-1
0-10

+ 63 * 0 + 74 * -1 + 114 * 0
+ 99 * -1 + 105 * 5 + 123 * -1
+ 73 * 0 + 58 * -1 + 30 * 0
= 171

116206-144
112-3748
-66199171

Test

Okay, let’s see if we can apply a small convolutional layer (3 kernels) to a grayscale image and have a look at the resulting activations (3 images) and the kernels after a single pass of the image through the layer.

Why you ask? If for no other reason it gives me a chance to try some new code. It may or may not be of use when I get to messing with the CNN GAN for the clothing images. Also, I thought it might be a little interesting. Oh yes, and an image is worth a thousand words!

I am, of course, using an image of a cat! It is 1024 x 1024 pixels. And is in fact a colour image. So it has three channels, of which I will only use one as I am pretending this is a grey scale image.

I loaded the image separately as I had wanted to check its shape and extract a single channel to present to the convolutional layer. I was going to use plt.imread() but:

This function exists for historical reasons. It is recommended to use PIL.Image.open instead for loading images.

matplotlib.pyplot.imread

Fortunately pillow is already installed in my conda environment.

I begin by converting the image (the first channel) to a suitable tensor. When I get the red channel pillow converts it to a grayscale image (‘L’). I am basically using the same transform process used when loading the images in previous projects. I did see people using numpy and converting to a tensor. But figured why try something new, when I already have something that works.

... ...
from PIL import Image
... ...
transform = trf.Compose([trf.ToTensor()])
c_in = Image.open("data/black_white_cat_1024.png")
# at first used the following, then switched to converting to grayscale
# cat_img = transform(c_in.getchannel("R")).unsqueeze(0)
cat_img = transform(c_in.convert("L")).unsqueeze(0)
print(f"Image size: {c_in.getbands()} @ {c_in.size}, tensor: {cat_img.shape}")
c_in.close()

Next I instantiate a convolutional layer. The layer and the image will be passed to a function defined below. I also print, to the terminal, the current state of the three kernel weights (supposedly randomly generated). Just ‘cuz.

# defaults for other parameters
conv_layer = nn.Conv2d(1, 3, kernel_size=5)

weight_data = conv_layer.weight.data
print(f"\n{weight_data}")

Finally I display the original image and the 3 output activations (images).

... ...
def plot_convs(image, conv_layer, axis=False):
  """ plot outputs after applying convolutional layer to image """
  filtered_img = conv_layer(image)
  n_out = filtered_img.shape[1]
  
  f_wd = (n_out + 1) * 4
  p_img = image.permute(2, 3, 1, 0).detach().squeeze()
  print(f"\nconv size: {n_out}, fig size: {f_wd}, plot image: {p_img.shape}")
  _, axs = plt.subplots(figsize=(f_wd, 4), ncols=n_out+1)
  axs[0].imshow(image.permute(2, 3, 1, 0).detach().squeeze(), cmap="gray")
  axs[0].set_title("Original")
  axs[0].grid(False)
  if not axis:
      axs[0].axis(False)
  fc1 = filtered_img.permute(2, 3, 1, 0)
  for n in range(n_out):
    p_img = fc1[:, :, n].detach().squeeze()
    print(f"\tconv {n}: image {fc1.shape}, p_img {p_img.shape}")
    axs[n+1].imshow(p_img, cmap="gray")
    axs[n+1].set_title(f"Kernel {n+1}")
    axs[n+1].grid(False)
    if not axis:
      axs[n+1].axis(False)

  plt.tight_layout()
... ...
plot_convs(cat_img, conv_layer)
plt.show()

The function plot_convs is actually coded higher up in the module. But, this was the time to show it.

output of 3 kernel convolutional layer on cat image, also displayed

And the terminal output is shown below.

(mclp-3.12) PS F:\learn\mcl_pytorch\chap4> python cnn_learn.py
Image size: ('R', 'G', 'B') @ (1024, 1024), tensor: torch.Size([1, 1, 1024, 1024])

tensor([[[[ 0.0114, -0.1353,  0.1548,  0.0486, -0.1816],
          [-0.0458, -0.1097,  0.1135, -0.1179, -0.1253],
          [ 0.1609,  0.1969, -0.0164,  0.0964, -0.0175],
          [-0.0469,  0.1172, -0.1158,  0.0083, -0.1968],
          [-0.0219, -0.1842,  0.1123,  0.1066,  0.1229]]],


        [[[ 0.1553,  0.1457, -0.0498,  0.1033,  0.1988],
          [ 0.0792, -0.1623, -0.1741,  0.0919, -0.0628],
          [-0.0361, -0.0867,  0.1713,  0.1492,  0.0553],
          [-0.0023, -0.0881, -0.1530, -0.0741,  0.0861],
          [ 0.0990, -0.0637, -0.1023, -0.1887, -0.1474]]],


        [[[-0.0403, -0.1722, -0.0024,  0.0550,  0.1128],
          [-0.0111, -0.1682,  0.0754, -0.0102,  0.0229],
          [ 0.1563,  0.0738,  0.0950,  0.0138, -0.1238],
          [-0.0626, -0.0299, -0.1354, -0.1932, -0.0421],
          [-0.0758, -0.0030, -0.1643,  0.1792, -0.0020]]]])

conv size: 3, fig size: 16, plot image: torch.Size([1024, 1024])
        conv 0: image torch.Size([1020, 1020, 3, 1]), p_img torch.Size([1020, 1020])
        conv 1: image torch.Size([1020, 1020, 3, 1]), p_img torch.Size([1020, 1020])
        conv 2: image torch.Size([1020, 1020, 3, 1]), p_img torch.Size([1020, 1020])

Done?

I think that’s it for this one. Lots of resources with much better explanations than I have provided are listed below. But, I wanted to try to document this to make sure I have some idea of what will, at some simple level, be going on in the CNN GAN.

Thought I’d also have a look at pooling and tranposed convolution layers in this post, but they will have to wait for the next one.

Until next time, perhaps you will get a little reading and writing of your own done.

Resources