I temporarily considered stopping work on the project while I took a side trip into neural network concepts. Something I did when I couldn’t understand how .backward() was able to be used on the loss tensor (Wasserstein distance and gradient penalty) we were generating for the critic’s performance/training. That led to the best part of two days spent searching and reading. A worthwhile endeavour I felt I should document/share. And I did tentatively reschedule the current draft future machine learning posts and start on one for the side trip. But I have reconsidered and decided I should finish the project before venturing into uncharted territory.

So, let’s see if we can get that training loop coded.

View Fake Images

Last project I included the code to create a tensor of labels and use it to generate and display, or save, fake images each epoch within the training loop. This time I am going to put that code in a function which I will call at some interval within the training loop.

This new function calls one I had written previously and copied over to this module. Though I did need to refactor it slightly.

def plot_tst_gen(epoch, i_show=False):
  noise = torch.randn(16, nz, 1, 1)
  labels = torch.zeros(16, 2, 1, 1)
  for i in range(1, 16, 2):
    labels[i, 0, :, :] = 1
  for i in range(0, 16, 2):
    labels[i, 1, :, :] = 1
  nz_lbls = torch.cat([noise, labels], dim=1).to(device)
  fakes = genr(nz_lbls)
  img = fakes[:, :3, :, :] / 2 + 0.5
  i_lbls = labels[:, 1].flatten().detach()
  # call the older function to generate and show or save the actual image
  img_grid_lbl(img, i_lbls, 4, i_show=i_show, epoch=epoch)

Train on Batch of Data

Rather than include the code for training on each batch of data in the training loop itself, I am also going to put it in a function to be called from within the loop. It will do everything necessary to train both networks. Including calculating the loss (Wasserstein distance) and gradient penalty as appropriate. Then use the calculated losses to train the networks and apply the optimizations. (Not sure the latter bit is the correct terminology, but…)

As done in most of the tutorials/examples I looked at, in each loop the critic will be trained more often than the generator. I am using a variable to control the ratio of critic training cycles to generator training cycles. As well as a variable for the multiplier to apply to the gradient penalty when determining the critic’s loss.

  c_g_ratio = 5           # ratio of critic training iterations to generator training iterations
  c_lambda = 10           # multiplier for gradient penalty when calculating critic loss

The primary reason for the extra critic training is that we are penalizing the critic but not the generator. So there is a greater risk of the generator overpowering the critic. Were that to happen, the model won’t converge to a satisfactory state. So, to balance things out we train the critic more frequently than the generator for each batch in each epoch. If the ratio is too large we risk the critic overpowering the generator. At which point the generator will stop learning no matter how many more epochs of training it gets. Some of the examples I viewed stopped the extra training after some number of batches/epochs.

However all this comes with a price. Even on GPU, the training will be rather slower—the gradient penalty requires you to compute the gradient of a gradient. And in our case 5 times per batch. So we are talking potentially a few minutes per epoch!

  def train_batch(one_hots, imgs_w_lbls, epoch):
    global do_dbg
    riwl = imgs_w_lbls.to(device)
    # this may not always be the default batch size 16
    b_sz = riwl.shape[0]
    # will train critic in 'c_g_ratio' to 1 ratio with generator
    # don't want generator to overpower critic, so want to
    # get critic to optimum performance as quickly as possible vs generator
    # without risking critic overpowering the generator
    for _ in range(c_g_ratio):
      noise = torch.randn(b_sz, nz, 1, 1)
      o_hs = one_hots.reshape(b_sz, 2, 1, 1)
      nz_lbls = torch.cat([noise, o_hs], dim=1).to(device)
      fake_imgs = genr(nz_lbls).to(device)
      fake_lbls = imgs_w_lbls[:, 3:, :, :].to(device)
      fiwl = torch.cat([fake_imgs, fake_lbls], dim=1).to(device)
      # make sure we make the following values available to backward pass
      crit_real = critic(riwl).reshape(-1).requires_grad_()
      crit_fake = critic(fiwl).reshape(-1).requires_grad_()
      gp = g_penalty(critic, riwl, fiwl)
      c_loss = (-(torch.mean(crit_real) - torch.mean(crit_fake)) + c_lambda * gp)
      critic.zero_grad()
      # keep loss tensor in computational graph so can recalc loss next iteration
      c_loss.backward(retain_graph=True)
      opt_c.step()
    # no gradient penalty on generator
    g_loss = -torch.mean(crit_fake)
    genr.zero_grad()
    g_loss.backward()
    opt_g.step()
    return c_loss, g_loss

Test Above Function

Okay, before trying to run a full training session of numerous epochs, I decided I should test the above for a single epoch. Rather than edit any of the global variables, I decided to just write an if block with the necessary code. And while I was at it, decided to save the network models as well.

... ...
# Save models, using both torchscript and model state
def sv_model(epoch, n_model, nm_nm, optimizer, loss, batch_sz, trn_len):
  m_script = torch.jit.script(n_model)
  fl_nm = Path(f"{nm_nm}_wgan-wp_g_ng_script_{batch_sz}_{trn_len}.pt")
  m_script.save(sv_dir / fl_nm)
  st_fl = Path(f"{nm_nm}_wgan-wp_g_ng_chkpt_{batch_sz}_{trn_len}.pt")
  torch.save({
            'epoch': epoch,
            'model_state_dict': n_model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            }, sv_dir / st_fl)
... ...
  if tst_t_btch:
    st_tm = time.perf_counter()
    cl_tot = 0
    gl_tot = 0
    epoch = 1
    trn_len = 1
    for _, i_lbls, one_hots, imgs_n_lbls in img_ldr:
      l_crit, l_gen = train_batch(one_hots, imgs_n_lbls, epoch)
      cl_tot += l_crit
      gl_tot += l_gen
    print(f"epoch {epoch}: critic loss: {cl_tot}, generator loss: {gl_tot}")
    plot_tst_gen(epoch, i_show=False)
    nd_tm = time.perf_counter()
    print(f"\ntime to run one epoch (while testing code): {nd_tm - st_tm}")  

if do_save:
  sv_model(epoch, genr, "gen", opt_g, gl_tot, batch_sz, trn_len)
  sv_model(epoch, critic, "critic", opt_c, cl_tot, batch_sz, trn_len)

And right away a bug. Those lines before the Traceback were some print statements I had added for another reason and had not yet removed. But, they would actually help lead me to the solution to the bug.

(mclp-3.12) PS F:\learn\mcl_pytorch\chap5> python wgan-gp_g_ng.py
models loaded to gpu, dataloader instantiated with modified dataset, no training code yet

one_hots: torch.Size([16, 2]), imgs_n_lbls: torch.Size([16, 5, 256, 256])

real: torch.Size([16, 5, 256, 256]), fake: torch.Size([16, 5, 256, 256])
        e_shape: [16, 1, 1, 1], mx_imgs: torch.Size([16, 5, 256, 256])

real: torch.Size([16, 5, 256, 256]), fake: torch.Size([16, 5, 256, 256])
        e_shape: [16, 1, 1, 1], mx_imgs: torch.Size([16, 5, 256, 256])

real: torch.Size([16, 5, 256, 256]), fake: torch.Size([16, 5, 256, 256])
        e_shape: [16, 1, 1, 1], mx_imgs: torch.Size([16, 5, 256, 256])

real: torch.Size([16, 5, 256, 256]), fake: torch.Size([16, 5, 256, 256])
        e_shape: [16, 1, 1, 1], mx_imgs: torch.Size([16, 5, 256, 256])

real: torch.Size([16, 5, 256, 256]), fake: torch.Size([16, 5, 256, 256])
        e_shape: [16, 1, 1, 1], mx_imgs: torch.Size([16, 5, 256, 256])

Traceback (most recent call last):
  File "F:\learn\mcl_pytorch\chap5\wgan-gp_g_ng.py", line 488, in <module>
    l_crit, l_gen = train_batch(one_hots, imgs_n_lbls, epoch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\learn\mcl_pytorch\chap5\wgan-gp_g_ng.py", line 476, in train_batch
    g_loss.backward()
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 512, 4, 4]] is at version 6; expected version 5 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Those five blocks of output beginning real: were being printed in the gradient penalty function during testing of the batch training function. And, it implies that the critic completed its 5 training iterations at the top of the function. That made me look at the traceback more carefully. Something was going wrong when training the generator. And, was happening during backward propagation.

It turns out I was trying to be too clever. In the critic training loop I was generating fake images. So I figured I could just reuse the last set to work out the loss for the generator.

    # no gradient penalty on generator
    g_loss = -torch.mean(crit_fake)

Sadly, that caused the error above. I really had to generate a new set of images.

    crit_gen = critic(fiwl).reshape(-1)
    # no gradient penalty on generator
    g_loss = -torch.mean(crit_gen)

Okay, bug fixed. Let’s run the test again. Well…

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 11.00 GiB of which 0 bytes is free. Of the allocated memory 10.22 GiB is allocated by PyTorch, and 15.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

My GPU has a goodly amount of memory. And, certainly not less than older GPUs that might be running this code. So, something in the code must be the issue.

I eventually found something on the web that suggested that the loss values being returned from the train_batch function had large gradients attached to them. After repeated iterations these grew too big for the GPU memory. The author of the item on the web, who found the reason for the memory issue, suggested detaching the loss values before returning them.

    return c_loss.detach(), g_loss.detach()

And that seemed to solve the memory issue.

Successful Tests

I decided to see what my GPU memory could handle in terms of batch size. Here’s some terminal output for various batch sizes.

batch size: 16
(mclp-3.12) PS F:\learn\mcl_pytorch\chap5> python wgan-gp_g_ng.py
models loaded to gpu, dataloader instantiated with modified dataset, testing training code
epoch 1: critic loss: -26876.869140625, generator loss: 27261.87109375

time to run one epoch (while testing code): 230.44278959999792

batch size: 32
(mclp-3.12) PS F:\learn\mcl_pytorch\chap5> python wgan-gp_g_ng.py
models loaded to gpu, dataloader instantiated with modified dataset, testing training code
epoch 1: critic loss: -10104.2744140625, generator loss: 10737.6328125

time to run one epoch (while testing code): 152.2530015999946

batch size: 40
(mclp-3.12) PS F:\learn\mcl_pytorch\chap5> python wgan-gp_g_ng.py
models loaded to gpu, dataloader instantiated with modified dataset, testing training code
epoch 1: critic loss: -7561.72265625, generator loss: 8080.75830078125

time to run one epoch (while testing code): 139.12233649998961

batch size: 64
(mclp-3.12) PS F:\learn\mcl_pytorch\chap5> python wgan-gp_g_ng.py
models loaded to gpu, dataloader instantiated with modified dataset, testing training code
epoch 1: critic loss: -2863.9296875, generator loss: 3895.00537109375

time to run one epoch (while testing code): 114.31068960000994

Don’t get fooled by those loss numbers. That’s the total loss for all batches. There would be four times fewer values added for a batch size of 64 verus 16.

Sample Generator Images for Two Batch Sizes

Sample of Generator Images After
1 Epoch with Batch Size of 16

sample of generator images after 1 epoch of training with batch size of 16

Sample of Generator Images After
1 Epoch with Batch Size of 64

sample of generator images after 1 epoch of training with batch size of 64

I expect the difference is because with a batch size of 16 the generator was trained 278 times. Whereas with a batch size of 64 it was only trained 70 times. A trade off between the time for each epoch of training and of performance. So I think I will compromise and use a batch size of 32. For now, a reasonable improvement in speed versus performance loss.

Done

This is much longer than I anticipated. The actual all out training will have to wait for, likely, the next post.

And there were a few other bugs primarily related to getting the images plotted. I decided to leave those out of the post. I will just say they pretty much were all related to my lack of understanding of the differences in use between tensors, Python lists and/or Numpy arrays.

Until next time, I hope your understanding of the subtleties is progressing better than mine.

Resources