Okay, after that bit of questionable troubleshooting covered in the last post, I am at a bit of loss as to what I should do next.
First Refactor Attempt
But, I have decided to refactor my code to not set a random seed during resumed training. Also, I had originally intended to use custom weights on critic and generator during the initial training session. And somehow failed to do so. But I would not do so during resumed training sessions.
I added some previous code to the utilities module and refactored it slightly to account for the change in the normaliztion function for this project. I.E. batch to instance.
def weights_init(m):
classname = m.__class__.__name__
if classname.find("Conv") != -1:
torch.nn.init.normal_(m.weight.data, 0.0, 0.02)
if hasattr(m, "bias") and m.bias is not None:
torch.nn.init.constant_(m.bias.data, 0.0)
elif classname.find("Norm2d") != -1:
torch.nn.init.normal_(m.weight.data, 1.0, 0.02)
torch.nn.init.constant_(m.bias.data, 0.0)
Then after the code initializing the 4 networks, I added the following. Moving the seed setting code from its previous location. It was previously at the start of the training loop.
if trn_model and not cfg.resume:
# set seed for repeatability
torch.manual_seed(cfg.pt_seed)
# custom weights for all networks
weights_init(genr_a)
weights_init(genr_b)
weights_init(disc_a)
weights_init(disc_b)
For resumed sessions, I had hard coded the number of epochs of training to \(5\). But, I am going to refactor that to use the epochs
variable from the global variables module. I have changed the default for that global from \(20\) to \(5\). There is a command line option I can use to set this value when executing the GAN’s main module, "-ep", "--epochs"
. I had to refactor a couple of different lines of code to get this to happen.
... ...
# for resumed training sessions, this needs to be set at the command line
strt_epoch = cfg.start_ep
# use number of epochs of training speficied by appropriate global variable
end_epoch = cfg.start_ep + cfg.epochs
print(f"training epochs: range({strt_epoch}, {end_epoch})")
... ...
for epoch in range(strt_epoch, end_epoch):
I am going to run an initial training session for 5 epochs with a batch size of 2 and see what happens.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -ep 5 -bs 2 -si 300
image and checkpoint directories created: runs\rek4_img & runs\rek4_sv
training epochs: range(0, 5)
starting training loop
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [10:38<00:00, 1.04it/s]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [10:36<00:00, 1.05it/s]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [10:38<00:00, 1.05it/s]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [10:35<00:00, 1.05it/s]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [10:35<00:00, 1.05it/s]
gpu: ~30 utilization, 66-68°C, 6.4/11.0 GB dedicated memory, 0.1/31.9 shared memory
I decided to run a few resumed training sessions.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 5 -ep 5 -bs 2 -si 300
... ...
training epochs: range(5, 10)
... ...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 10 -ep 5 -bs 2 -si 300
... ...
training epochs: range(10, 15)
... ...
One generator, horse to zebra, seems to be getting trained, the other not so much. No idea why?
I modified the training code to save loss data more frequently. This extra loss data will be saved in the checkpoints at the end of each training run. Once again I failed to save a set of checkpoints. The ones after the first resumed training session. Truly slow witted some days.
I have decided, over next day or two, to run resumed training sessions until I get to 40 or 50 total epochs of training. If still nowhere good, I will tackle training the discriminator more or less often than the generator. Or perhaps play with the learning rate.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 15 -ep 5 -bs 2 -si 300
... ....
training epochs: range(15, 20)
... ....
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 20 -ep 5 -bs 2 -si 300
... ...
training epochs: range(20, 25)
... ...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 25 -ep 5 -bs 2 -si 300
... ...
training epochs: range(25, 30)
... ...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 30 -ep 5 -bs 2 -si 300
... ...
training epochs: range(30, 35)
... ...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 35 -ep 5 -bs 2 -si 300
... ...
training epochs: range(35, 40)
... ...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 40 -ep 5 -bs 2 -si 300
... ...
training epochs: range(40, 45)
... ...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek4 -rs -se 45 -ep 5 -bs 2 -si 300
... ...
training epochs: range(45, 50)
... ...
After 50 epochs, over 8 hours of computer execution time, there does appear to be some improvement in the CycleGAN’s generator outputs. But nowhere near usable. And, when scanning over all the images I saved, Generator B seems to do better than Generator A. Though better is definitely a relative term.
Where Next?
I have been saving the total discriminator and the total generator error every 50 iterations in all four checkpoint files at the end of each training session.
Looks like the generator error rate is improving. Not so much the discriminator.
So, I am going to change things and try again for 50 epochs. The first change will be training the discriminator 2 or 3 times more often than the generator for the first few epochs. As we did with glasses/no glasses conditional GAN. Though there, for the first 3 or 5 epochs, we trained the critic 5 times for each time we trained the generator. As that added considerable time to the execution of each of those early epochs, I am going to try a smaller number of extra training sessions per iteration.
I am also thinking I will record all the different error values every 50 iterations or so. And save those either in the checkpoint files or a separate file along with the checkpoint files. The former way I will have to deal with saving them in only one of the checkpoint files. I am hoping if the training continues to be rather slow, that one of the different error types might help me sort things out a little better. Though that could end up being the long road to nowhere.
This One Done M’thinks
Well, it has been a few days since I wrote the above. I just can’t seem to find the energy to rework my model and attempt to see if that results in more efficient training. I think that is due to the fact that I really don’t know what I am doing. More importantly, I just don’t have the requisite knowledge and experience to make the necessary decisions.
But I have decided to just try something and see where it takes me. There are are a fair number of options. I can slowly try them all. Though I don’t have a lot of time as this draft post is not all that far away from its planned publishing date.
And, with that, I will be making you wait to see where I/we go next. My sincerest apologies!