MCL with Pytorch: Cycle GAN, Part IX

Published Mon, Dec 9, 2024 by Rick Kochanski

Okay, as I suggested last post, I am going to try a series of training sessions (initial and resumed) with additional training for the generators to see how that affects the model’s learning progress.

Extra Training for Generators

I will go for an initial session of 15 epochs with 10 epochs of additional training for the generators. But, this time, I will only train the generators 2 times in each iteration. And I will generate images less often. The losses will still be saved every 50 iterations.

BUG!

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek7 -ep 15 -bs 2 -sc 50 -si 400 -xg 2 -xe 10
image and checkpoint directories created: runs\rek7_img & runs\rek7_sv
training epochs: range(0, 15)
starting training loop
epoch:   0%|                                                                                   | 0/667 [00:00<?, ?it/s]E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py:266: UserWarning: Error detected in TanhBackward0. Traceback of forward call that caused the error:
  File "F:\learn\mcl_pytorch\proj6\cyc_gan.py", line 195, in <module>
    fake_b = genr_b(img_a)
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "F:\learn\mcl_pytorch\proj6\models.py", line 119, in forward
    return torch.tanh(x)
 (Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\autograd\python_anomaly_mode.cpp:118.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
epoch:   0%|                                                                                   | 0/667 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "F:\learn\mcl_pytorch\proj6\cyc_gan.py", line 256, in <module>
    genr_loss.backward()
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Well that didn’t work. No idea why. Okay, it appears to be tied to tanh used to generate the output for the generators. I jumped in and started messing around. Didn’t take a lot of notes. But eventually modified the generator model. I moved the torch.tanh() method into the self.decoder property. But I had to switch it to nn.Tanh() for that to work. And, removed that torch.tanh() call from the model’s forward method.

That portion of the Generator class now looks like the following.

... ...
    # decoding/upsampling, essentially mirror of the encoder, last block does not upsample
    self.decoder = nn.Sequential(
      GConvBlock(init_feats*4, init_feats*2, k_sz=3, s_sz=2, ip_sz=1, upsample=True),
      GConvBlock(init_feats*2, init_feats, k_sz=3, s_sz=2, ip_sz=1, upsample=True),
      GConvBlock(init_feats, 3, k_sz=7, s_sz=1, ip_sz=3, activation=False, normalize=False),
      nn.Tanh()
    )


  def forward(self, x):
    x = self.encoder(x)
    x = self.residuals(x)
    x = self.decoder(x)
    return(x)

I then added retain_graph=True to all the calls to the backward() method in the generator training loop. I, in fact, put the call to backward() in an if/else block. The if block called genr_loss.backward(retain_graph=True) and the else called genr_loss.backward(). I didn’t want to maintain the computational graph after the last iteration. Not really sure I want to maintain it after any iteration of extra training. But for now that’s what I am doing, based on last few lines of the error message above.

That did not fix the error. So I added a bunch of print statements to try to get a handle on what was going on. Their output is hopefully shown in green below.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek7 -ep 15 -bs 2 -sc 50 -si 400 -xg 2 -xe 10

{'run_nm': 'rek7', 'dataset_nm': 'no_nm', 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 15, 'batch_sz': 2, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 2, 'x_eps': 10}
image and checkpoint directories created: runs\rek7_img & runs\rek7_sv
training epochs: range(0, 15)
starting training loop
epoch:   0%|                                                                                   | 0/667 [00:00<?, ?it/s] 
extra generator training iteration 0:
        genr_loss.backward(retain_graph=True)
 extra generator training iteration 1:
E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py:266: UserWarning: Error detected in ConvolutionBackward0. Traceback of forward call that caused the error:
File "F:\learn\mcl_pytorch\proj6\cyc_gan.py", line 248, in <module>
    fake_b = genr_b(img_a).to(cfg.device)
... ...
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 64, 7, 7]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Okay, so the first iteration of extra training worked just fine. Everything crashed during the second iteration. And the specific issue is with the code, fake_b = genr_b(img_a).to(cfg.device. After a night’s sleep, I eventually figured that it was the fake_b tensor that was no longer available to the second call to genr_loss.backward(). So I added code at the top of the generator training loop to regenerate the fake images for each repetition of extra training.

      # Generators
      # perhaps train generator more often during each iteration
      for xi in range(cfg.x_genr):
        # print('\033[92m', f"extra generator training iteration {xi}:", '\033[0m')
        if xi < cfg.x_genr:
          # regenerate the fake images so available autograd
          # generate fake horse image
          fake_a = genr_a(img_b).to(cfg.device)
          # generate fake zebra image
          fake_b = genr_b(img_a).to(cfg.device)

And that seemed to fix the problem. I then removed the if/else block controlling the use of retain_graph=True. Simply calling genr_loss.backward() on each iteration of generator training. The bug did not return. So, somewhere in the first iteration of generator training those fake image tensors were being dropped/removed/whatever.

Now let’s try to get some actual training done.

Renewed Training Attempt (Test)

Test run first, to get some idea of timing for each epoch

Utilization: ~25-45%, dedicated GPU memory: 6.5/11 GB, temperature: 64-68°C

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek7 -ep 5 -bs 2 -sc 50 -si 400 -xg 2 -xe 3
 {'run_nm': 'rek7', 'dataset_nm': 'no_nm', 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 5, 'batch_sz': 2, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 2, 'x_eps': 3}
image and checkpoint directories created: runs\rek7_img & runs\rek7_sv
training epochs: range(0, 5)
starting training loop
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [21:40<00:00,  1.95s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [21:39<00:00,  1.95s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [21:36<00:00,  1.94s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [12:45<00:00,  1.15s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [12:45<00:00,  1.15s/it]

Wow that single instance of extra training for the generator sure increases the time for each epoch of training. Likely due to the number of fake images being create in each training loop. Pretty much one for each generator for each of the loss types (adversarial, cycle and identity).

Without extra training for either module, the time for each epoch is higher than before. Likely because I am recreating the fake images each iteration. So, I am going to change if xi < cfg.x_genr: to if cfg.x_genr > 1 and xi < cfg.x_genr: and see what happens.

Needless to say, with only 5 epochs of training, generated images are pretty crappy. So, I am not including any of those from the above test run in this post.

A Larger Training Run

I thought about running 15 epochs with 10 of extra training for the generators. But that’s going to be close to 4 hours of computer running time for just the first 10 epochs. I am not keen on having the pc run continuously for that length of time. So I am going to start with 5 epochs all with extra generator training. Then do that again as extended training at some later time/date. Then finally 5 epochs of extended training without any extra training for any network.

In the end I decided to run 15 epochs with extra generator training before reverting to equal training for discriminators and generators.

Okay, I have run the training, 5 epochs at at time, for 100 epochs. A total of ~23.5 hours of pc running time. And, the images are still pretty crappy. In addition, the discriminator losses are progressing, slowly, in the right direction. But, the generator losses are growing rather than decreasing. The CycleGAN does not appear to be converging.

I will start by showing some of the best examples saved during training. That best is very much relative to the other images generated/saved during training. Definitely not in the sense of any kind of success. And then I will show the graphs for the various loss types. And, that will be it for this post.

Sample CycleGAN Images

Plotted Losses

These are losses sampled at an interval of 50 iterations over the 100 epochs, a total of 56,700 iterations.

plot of discriminator losses for 100 epochs of training — Losses for Discriminators and Generators

plot of generator losses for 100 epochs of training — Losses for Discriminators and Generators

plot of cycle losses for 100 epochs of training — Cycle and Identity Losses

plot of identity losses for 100 epochs of training — Cycle and Identity Losses

Done

This project is proving rather more difficult than I expected. Not really sure what I am doing wrong. And, the number of associated posts is getting significant in size.

That said, for my next refactor, I am going to add/use a learning rate scheduler. It will start with a relatively large learning rate and as training progresses the learning rate will be reduced to the level I have been using for all the previous training. I may also play with the momentum for the optimizer(s). I believe Adam has a momentum parameter.

I may also go back to using a single optimizer for the discriminators. We will see how that works out. May need to use the scheduler as well as additional training for the generators—but one step at a time.

Until next time, may your training prove more successful than my CycleGAN attempts to-date.