Okay, as I suggested last post, I am going to try a series of training sessions (initial and resumed) with additional training for the generators to see how that affects the model’s learning progress.
Extra Training for Generators
I will go for an initial session of 15 epochs with 10 epochs of additional training for the generators. But, this time, I will only train the generators 2 times in each iteration. And I will generate images less often. The losses will still be saved every 50 iterations.
BUG!
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek7 -ep 15 -bs 2 -sc 50 -si 400 -xg 2 -xe 10
image and checkpoint directories created: runs\rek7_img & runs\rek7_sv
training epochs: range(0, 15)
starting training loop
epoch: 0%| | 0/667 [00:00<?, ?it/s]E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py:266: UserWarning: Error detected in TanhBackward0. Traceback of forward call that caused the error:
File "F:\learn\mcl_pytorch\proj6\cyc_gan.py", line 195, in <module>
fake_b = genr_b(img_a)
File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "F:\learn\mcl_pytorch\proj6\models.py", line 119, in forward
return torch.tanh(x)
(Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\autograd\python_anomaly_mode.cpp:118.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
epoch: 0%| | 0/667 [00:04<?, ?it/s]
Traceback (most recent call last):
File "F:\learn\mcl_pytorch\proj6\cyc_gan.py", line 256, in <module>
genr_loss.backward()
File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\_tensor.py", line 522, in backward
torch.autograd.backward(
File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
Well that didn’t work. No idea why. Okay, it appears to be tied to tanh
used to generate the output for the generators. I jumped in and started messing around. Didn’t take a lot of notes. But eventually modified the generator model. I moved the torch.tanh()
method into the self.decoder
property. But I had to switch it to nn.Tanh()
for that to work. And, removed that torch.tanh()
call from the model’s forward
method.
That portion of the Generator class now looks like the following.
... ...
# decoding/upsampling, essentially mirror of the encoder, last block does not upsample
self.decoder = nn.Sequential(
GConvBlock(init_feats*4, init_feats*2, k_sz=3, s_sz=2, ip_sz=1, upsample=True),
GConvBlock(init_feats*2, init_feats, k_sz=3, s_sz=2, ip_sz=1, upsample=True),
GConvBlock(init_feats, 3, k_sz=7, s_sz=1, ip_sz=3, activation=False, normalize=False),
nn.Tanh()
)
def forward(self, x):
x = self.encoder(x)
x = self.residuals(x)
x = self.decoder(x)
return(x)
I then added retain_graph=True
to all the calls to the backward()
method in the generator training loop. I, in fact, put the call to backward()
in an if/else
block. The if
block called genr_loss.backward(retain_graph=True)
and the else
called genr_loss.backward()
. I didn’t want to maintain the computational graph after the last iteration. Not really sure I want to maintain it after any iteration of extra training. But for now that’s what I am doing, based on last few lines of the error message above.
That did not fix the error. So I added a bunch of print statements to try to get a handle on what was going on. Their output is hopefully shown in green below.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek7 -ep 15 -bs 2 -sc 50 -si 400 -xg 2 -xe 10
{'run_nm': 'rek7', 'dataset_nm': 'no_nm', 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 15, 'batch_sz': 2, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 2, 'x_eps': 10}
image and checkpoint directories created: runs\rek7_img & runs\rek7_sv
training epochs: range(0, 15)
starting training loop
epoch: 0%| | 0/667 [00:00<?, ?it/s]
extra generator training iteration 0:
genr_loss.backward(retain_graph=True)
extra generator training iteration 1:
E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py:266: UserWarning: Error detected in ConvolutionBackward0. Traceback of forward call that caused the error:
File "F:\learn\mcl_pytorch\proj6\cyc_gan.py", line 248, in <module>
fake_b = genr_b(img_a).to(cfg.device)
... ...
File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\_tensor.py", line 522, in backward
torch.autograd.backward(
File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\autograd\__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 64, 7, 7]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!