Okay, carrying on from last time (coding the generator class, the training loop and such), I did an initial training session of 20 epochs. I was using a batch size of 4. That took over 2 hours to complete.
Too many images being saved, so changed cfg.sv_img_cyc
from \(50\) to \(150\) (now\(~4\) images per epoch rather than\(~12\)).
I then ran 5 resumed training sessions of 5 epochs each (the default, for now hardcoded). Required some coding to get it to work. That gave a total of 45 epochs of training. And as nothing seemed to be improving, I decided to run one extended session of five epochs at a batch size of 1. That took ~90 minutes to run.
With batch size of 1, GPU utilization down around 15-30%, memory 4.1/42.9 GB, shared 0.1/31.9 GB, temperature 60-62°C. Definitely easier on the GPU than a batch size of 4. But really increases the run time per epoch (~6 minutes to ~18 minutes). So maybe not really easier on the GPU.
Okay, that got things working pretty well for the zebra to horse cycle. But still no success the other way. Reviewing my code I found the following in the generator section of the training loop.
pred_a_fake = disc_a(fake_a)
pred_b_fake = disc_a(fake_b)
So have modified that to be:
pred_a_fake = disc_a(fake_a)
pred_b_fake = disc_b(fake_b)
Going to start training from scratch using a batch size of 1 and a couple or three epochs and see what happens.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek3 -ep 3 -bs 1 -si 600
image and checkpoint directories created: runs\rek3_img & runs\rek3_sv
training epochs: range(0, 3)
starting training loop
epoch: 100%|███████████████████████████████████████████████████████████████████████| 1334/1334 [18:17<00:00, 1.22it/s]
epoch: 100%|███████████████████████████████████████████████████████████████████████| 1334/1334 [18:15<00:00, 1.22it/s]
epoch: 100%|███████████████████████████████████████████████████████████████████████| 1334/1334 [18:12<00:00, 1.22it/s]
Image modification pretty crappy. Much more training required. Ran 4 more resumed sessions at 5 epochs each.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek3 -rs -se 3 -bs 1 -si 600
...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek3 -rs -se 8 -bs 1 -si 600
...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek3 -rs -se 13 -bs 1 -si 600
...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek3 -rs -se 18 -bs 1 -si 600
...
Doesn’t look like my networks are getting any better when resuming the training sessions. Images from different sessions look pretty much the same.
So what is going on? Or rather, not going on?
DEBUG
Okay, I am going to write a small module to look at some of the contents of my model’s state dictionaries at various stages of training. Hopefully that will help sort things out.
I will start by getting all the available backup directories for the run “rek3”. They are in subdirectories in the main run directory for saving model checkpoints, ./rek3_sv
. There are four of them. Should have been five but I forgot to make a backup for one of the resumed sessions. The last backup is identical to the checkpoint files saved, in ./rek3_sv
, after the latest training session.
I will then load all the checkpoints files for a specific model, say discriminator A, into a list in memory. Then for each saved checkpoint I will print out the loss values and a few of the values for the first convolutional blocks biases and weights. They should hopefully change with each checkpoint.
I am going to show the setup code, some the imports won’t be used until later in the post.
# ../proj6/find_bugs.py
# Ver 0.1.0: 2024.08.31, rek,
# keep debug code out of other project modules??
from pathlib import Path
import sys
import torch
import config as cfg
from models import Discriminator, Generator
from utils import ld_chkpt
# cycleGAN training not progressing as well as I expected.
# going to try to compare some of the parameters from various
# saved models to see if they are in fact changing
# get command line parameters, update globals, create project sub-directories
cl_args = cfg.get_cl_args()
cfg.updt_cl_args(cl_args)
# need to update dir paths
cfg.mk_dirs()
# get available backup dirs
bk_dirs = [item for item in cfg.sv_dir.iterdir() if item.is_dir()]
bk_pt_fls = []
for bkd in bk_dirs:
bk_pt_fls.append([item for item in bkd.iterdir() if item.is_file()])
# for bkd in bk_pt_fls:
# print(bkd)
def print_st_dict_data(st_dicts, key):
print(f"\nstate dict sample data for {key}")
for ndx in range(len(st_dicts)):
print(f"\nbkup chkpt #{ndx + 1}:")
if "bias" in key:
print(f"{st_dicts[ndx][key][:10]}")
else:
print(f"{st_dicts[ndx][key][1,1,]}")
def print_st_dict_data_2(chk_pts, st_dict, key):
print(f"\nstate dict sample data for {key}")
for ndx in range(len(chk_pts)):
print(f"\nbkup chkpt #{ndx + 1}:")
if "bias" in key:
print(f"{chk_pts[ndx][st_dict][key][:10]}")
else:
print(f"{chk_pts[ndx][st_dict][key][1,1,]}")
da_chkpts = get_chkpts(bk_pt_fls, 0)
cp_keys = list(da_chkpts[0].keys())
print("\ngetting discriminator a chkpt backups")
print(f"number dicts: {len(da_chkpts)}")
print(f"\n1st dicts keys: {cp_keys}")
d_losses, g_losses = [], []
for ndx in range(len(da_chkpts)):
d_losses.append(da_chkpts[ndx]["c_loss"])
g_losses.append(da_chkpts[ndx]["g_loss"])
print(f"\ndiscriminator losses: {d_losses}")r
dmsd_keys = list(da_chkpts[0]["model_state_dict"].keys())
print_st_dict_data_2(da_chkpts, "model_state_dict", dmsd_keys[1])
print_st_dict_data_2(da_chkpts, "model_state_dict", dmsd_keys[0])
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python find_bugs.py -rn rek3
image and checkpoint directories created: runs\rek3_img & runs\rek3_sv
getting discriminator a chkpt backups
number dicts: 4
1st dicts keys: ['batch_sz', 'epoch', 'c_loss', 'g_loss', 'run_nm', 'model_state_dict', 'optimizer_state_dict']
discriminator losses: [0.4448348581790924, 0.6673791408538818, 0.5696483850479126, 0.4625711441040039]
state dict sample data for cb1.conv_block.0.bias
bkup chkpt #1:
tensor([ 0.0983, 0.1170, 0.0329, 0.0984, 0.1423, 0.0171, -0.0271, -0.0006,
-0.1538, 0.0123], device='cuda:1')
bkup chkpt #2:
tensor([ 0.0991, 0.1029, 0.0232, 0.0799, 0.1575, 0.0182, -0.0240, -0.0130,
-0.2108, -0.0104], device='cuda:1')
bkup chkpt #3:
tensor([ 0.0911, 0.0840, 0.0166, 0.0688, 0.1486, 0.0082, -0.0234, -0.0148,
-0.2401, -0.0165], device='cuda:1')
bkup chkpt #4:
tensor([ 0.0653, 0.0528, -0.0005, 0.0329, 0.1112, 0.0030, -0.0393, -0.0306,
-0.2805, -0.0345], device='cuda:1')
state dict sample data for cb1.conv_block.0.weight
bkup chkpt #1:
tensor([[-0.1351, -0.0496, 0.1311, 0.0951],
[-0.1493, -0.0565, 0.0354, -0.0691],
[-0.1068, 0.0686, -0.0784, 0.0880],
[-0.0303, -0.0653, -0.1430, 0.1246]], device='cuda:1')
bkup chkpt #2:
tensor([[-0.1471, -0.0598, 0.1397, 0.0831],
[-0.1679, -0.0583, 0.0452, -0.0670],
[-0.1103, 0.0641, -0.0629, 0.0881],
[-0.0447, -0.0668, -0.1380, 0.1264]], device='cuda:1')
bkup chkpt #3:
tensor([[-0.1496, -0.0717, 0.1435, 0.0803],
[-0.1714, -0.0602, 0.0464, -0.0652],
[-0.1136, 0.0617, -0.0569, 0.0888],
[-0.0575, -0.0719, -0.1353, 0.1315]], device='cuda:1')
bkup chkpt #4:
tensor([[-0.1518, -0.0810, 0.1509, 0.0761],
[-0.1838, -0.0662, 0.0521, -0.0595],
[-0.1208, 0.0524, -0.0418, 0.0918],
[-0.0672, -0.0741, -0.1235, 0.1349]], device='cuda:1')
Without having any idea what those numbers should be, it does look like they are being modifed for each training session. Initial or resumed. So, no real help. Let’s look at generator A.
ga_chkpts = get_chkpts(bk_pt_fls, 2)
# cp_keys = list(ga_chkpts[0].keys())
print("\ngetting generator a chkpt backups")
print(f"number dicts: {len(ga_chkpts)}")
# print(f"\n1st dicts keys: {cp_keys}")
print(f"\ngenerator losses: {g_losses}")
gmsd_keys = list(ga_chkpts[0]["model_state_dict"].keys())
print_st_dict_data_2(ga_chkpts, "model_state_dict", gmsd_keys[1])
print_st_dict_data_2(ga_chkpts, "model_state_dict", gmsd_keys[0])
getting generator a chkpt backups
number dicts: 4
generator losses: [4.498274326324463, 3.1302247047424316, 2.9929826259613037, 2.7416255474090576]
state dict sample data for encoder.0.layers.0.bias
bkup chkpt #1:
tensor([-0.0728, 0.0021, 0.0289, -0.0565, 0.0973, 0.0329, -0.0535, -0.0022,
-0.0427, -0.0523], device='cuda:1')
bkup chkpt #2:
tensor([-0.1043, 0.0099, 0.0129, -0.0774, 0.1214, 0.0642, -0.0926, -0.0068,
-0.0591, -0.0717], device='cuda:1')
bkup chkpt #3:
tensor([-0.1245, 0.0124, -0.0018, -0.0942, 0.1425, 0.1177, -0.1156, -0.0049,
-0.0752, -0.0727], device='cuda:1')
bkup chkpt #4:
tensor([-0.1503, 0.0175, -0.0108, -0.1175, 0.1688, 0.2086, -0.1392, 0.0283,
-0.0918, -0.0641], device='cuda:1')
state dict sample data for encoder.0.layers.0.weight
bkup chkpt #1:
tensor([[ 0.0354, 0.0511, -0.0259, 0.0066, 0.0022, -0.0287, 0.0415],
[-0.0598, 0.0118, 0.0031, -0.0534, 0.0210, 0.0511, 0.0315],
[ 0.0837, 0.0238, -0.0035, 0.0750, -0.0161, -0.0269, 0.0596],
[-0.0445, -0.0087, 0.0772, 0.0769, -0.0540, -0.0356, 0.0154],
[-0.0401, 0.0697, -0.0562, 0.0316, -0.0594, 0.0136, 0.0769],
[ 0.0607, -0.0384, -0.0122, 0.0668, -0.0617, -0.0736, -0.0461],
[-0.0216, 0.0559, -0.0482, -0.0399, -0.0317, 0.0748, -0.0074]],
device='cuda:1')
bkup chkpt #2:
tensor([[ 3.8597e-02, 5.2728e-02, -2.7544e-02, 7.8864e-03, 2.9244e-03,
-2.7138e-02, 4.0921e-02],
[-5.8707e-02, 9.8857e-03, -3.9948e-05, -5.4584e-02, 2.1081e-02,
5.0924e-02, 3.2855e-02],
[ 8.5941e-02, 2.5662e-02, -1.1139e-03, 7.6972e-02, -1.6330e-02,
-2.4495e-02, 6.2503e-02],
[-4.4474e-02, -7.7197e-03, 8.0925e-02, 7.6804e-02, -5.6450e-02,
-3.4914e-02, 1.8905e-02],
[-4.0365e-02, 7.0488e-02, -5.4640e-02, 3.0078e-02, -6.0288e-02,
1.7150e-02, 7.9167e-02],
[ 6.0737e-02, -3.9165e-02, -1.0998e-02, 6.5006e-02, -5.8187e-02,
-6.8254e-02, -4.4122e-02],
[-1.8760e-02, 5.9782e-02, -4.3544e-02, -3.7958e-02, -2.3999e-02,
8.3374e-02, -5.4020e-03]], device='cuda:1')
bkup chkpt #3:
tensor([[ 0.0422, 0.0557, -0.0253, 0.0108, 0.0046, -0.0273, 0.0382],
[-0.0552, 0.0127, 0.0016, -0.0530, 0.0225, 0.0517, 0.0331],
[ 0.0896, 0.0279, 0.0021, 0.0808, -0.0148, -0.0213, 0.0646],
[-0.0404, -0.0058, 0.0844, 0.0798, -0.0566, -0.0345, 0.0217],
[-0.0371, 0.0716, -0.0526, 0.0308, -0.0603, 0.0185, 0.0793],
[ 0.0630, -0.0381, -0.0102, 0.0639, -0.0568, -0.0668, -0.0448],
[-0.0168, 0.0626, -0.0413, -0.0372, -0.0211, 0.0852, -0.0064]],
device='cuda:1')
bkup chkpt #4:
tensor([[ 0.0419, 0.0597, -0.0233, 0.0165, 0.0062, -0.0309, 0.0293],
[-0.0562, 0.0109, -0.0002, -0.0535, 0.0236, 0.0470, 0.0272],
[ 0.0873, 0.0289, 0.0007, 0.0845, -0.0161, -0.0172, 0.0624],
[-0.0368, -0.0050, 0.0874, 0.0824, -0.0590, -0.0371, 0.0199],
[-0.0361, 0.0695, -0.0590, 0.0330, -0.0663, 0.0206, 0.0763],
[ 0.0649, -0.0366, -0.0136, 0.0598, -0.0574, -0.0683, -0.0484],
[-0.0214, 0.0631, -0.0447, -0.0363, -0.0222, 0.0879, -0.0142]],
device='cuda:1')
And again, without having any idea what those numbers should be, it does look like they are being modifed for each training session. Initial or resumed. So, no real help. Let’s look at the discriminator optimizer.
dosd_keys = list(da_chkpts[2]["optimizer_state_dict"].keys())
print(f"\ndiscriminator optimizer keys: {dosd_keys}")
for ndx in range(len(da_chkpts)):
print(f"\nbkup chkpt #{ndx + 1}:")
print(f'\n{da_chkpts[ndx]["optimizer_state_dict"]["param_groups"]}')
print(f'\ntotal steps: {da_chkpts[ndx]["optimizer_state_dict"]["state"][0]["step"]}\n')
discriminator optimizer keys: ['state', 'param_groups']
bkup chkpt #1:
[{'lr': 0.0001, 'betas': (0.5, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'maximize': False, 'foreach': None, 'capturable': False, 'differentiable': False, 'fused': None, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}]
total steps: 4002.0
bkup chkpt #2:
[{'lr': 0.0001, 'betas': (0.5, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'maximize': False, 'foreach': None, 'capturable': False, 'differentiable': False, 'fused': None, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}]
total steps: 10672.0
bkup chkpt #3:
[{'lr': 0.0001, 'betas': (0.5, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'maximize': False, 'foreach': None, 'capturable': False, 'differentiable': False, 'fused': None, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}]
total steps: 17342.0
bkup chkpt #4:
[{'lr': 0.0001, 'betas': (0.5, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'maximize': False, 'foreach': None, 'capturable': False, 'differentiable': False, 'fused': None, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]}]
total steps: 30682.0
est epochs: 23.0
And, that number of estimated epochs based on the last total steps value is in fact correct. So, once again things would appear to be going according to plan. I won’t bother looking at the values in the state
dictionary. Don’t expect they will tell me any more than the biases and weights I looked at earlier for the two models/networks.
At first I was confused by the list of values for the params
key. Those values are the keys for the state
dictionary. Which, by the way, is rather large.
It has \(20\) values. Each of those is supposed to be one of the model parameters
that the optimizer is tracking. When I looked at the discriminator model I couldn’t find that many potential parameters.
disc_a = Discriminator(3, 64, 1.0, [4, 4, 4, 4, 4], [2, 2, 2, 2, 1], [1, 0, 0, 0 ,1], lr_slp=cfg.c_lr_slp).to(cfg.device)
tot_params = 0
for name, parameter in disc_a.named_parameters():
if not parameter.requires_grad:
continue
params = parameter.numel()
print(f"{name}, {params}")
tot_params += params
print(f"total parameters: {tot_params}")
cb1.conv_block.0.weight, 3072
cb1.conv_block.0.bias, 64
cb2.conv_block.0.weight, 131072
cb2.conv_block.0.bias, 128
cb3.conv_block.0.weight, 524288
cb3.conv_block.0.bias, 256
cb4.conv_block.0.weight, 2097152
cb4.conv_block.0.bias, 512
cb5.conv_block.0.weight, 8192
cb5.conv_block.0.bias, 1
total parameters: 2,764,737
Wow, 2¾ million model parameters. The parameters, key [param_groups][params]
, the optimizer is talking about are the models named parameters. But I see \(10\) not \(20\). Then I recalled that the optimizer was tracking both discriminator models/networks. I.E \(2 * 10 = 20\).
Well, sadly, I am not sure the above information tells me anything at all. Probably should have saved a lot more loss information for each session.
However, I was curious about the number of parameters in the generator networks. So I modified the code to do just that.
encoder.0.layers.0.weight, 9408
encoder.0.layers.0.bias, 64
encoder.1.layers.0.weight, 73728
encoder.1.layers.0.bias, 128
encoder.2.layers.0.weight, 294912
encoder.2.layers.0.bias, 256
residuals.0.layers.0.layers.0.weight, 589824
residuals.0.layers.0.layers.0.bias, 256
residuals.0.layers.1.layers.0.weight, 589824
residuals.0.layers.1.layers.0.bias, 256
residuals.1.layers.0.layers.0.weight, 589824
residuals.1.layers.0.layers.0.bias, 256
residuals.1.layers.1.layers.0.weight, 589824
residuals.1.layers.1.layers.0.bias, 256
residuals.2.layers.0.layers.0.weight, 589824
residuals.2.layers.0.layers.0.bias, 256
residuals.2.layers.1.layers.0.weight, 589824
residuals.2.layers.1.layers.0.bias, 256
residuals.3.layers.0.layers.0.weight, 589824
residuals.3.layers.0.layers.0.bias, 256
residuals.3.layers.1.layers.0.weight, 589824
residuals.3.layers.1.layers.0.bias, 256
residuals.4.layers.0.layers.0.weight, 589824
residuals.4.layers.0.layers.0.bias, 256
residuals.4.layers.1.layers.0.weight, 589824
residuals.4.layers.1.layers.0.bias, 256
residuals.5.layers.0.layers.0.weight, 589824
residuals.5.layers.0.layers.0.bias, 256
residuals.5.layers.1.layers.0.weight, 589824
residuals.5.layers.1.layers.0.bias, 256
residuals.6.layers.0.layers.0.weight, 589824
residuals.6.layers.0.layers.0.bias, 256
residuals.6.layers.1.layers.0.weight, 589824
residuals.6.layers.1.layers.0.bias, 256
residuals.7.layers.0.layers.0.weight, 589824
residuals.7.layers.0.layers.0.bias, 256
residuals.7.layers.1.layers.0.weight, 589824
residuals.7.layers.1.layers.0.bias, 256
residuals.8.layers.0.layers.0.weight, 589824
residuals.8.layers.0.layers.0.bias, 256
residuals.8.layers.1.layers.0.weight, 589824
residuals.8.layers.1.layers.0.bias, 256
decoder.0.layers.0.weight, 294912
decoder.0.layers.0.bias, 128
decoder.1.layers.0.weight, 73728
decoder.1.layers.0.bias, 64
decoder.2.layers.0.weight, 9408
decoder.2.layers.0.bias, 3
total parameters: 11,378,179
Double wow! Expected there would be a lot more than for the discriminators, but ~11⅓ million was certainly more than I expected. And, given the CycleGAN has two discriminators and two generators…
total for cycleGAN: 28,285,832
And, every one of those is updated every iteration during training. I know nothing like the billions of parameters in the big LLM out there these days. But I only have one cpu and one gpu. They have 10s of thousands at their disposal.
A Thought or Two
But there are a couple of things that came to mind while I was working on this effort to sort the model’s failure to learn.
The first is that, when training, I set a random seed to allow reproducibility.
# set seed for repeatability
torch.manual_seed(cfg.pt_seed)
This means that every time I resume training, I expect the images are presented to the networks in a similar order every time. Don’t expect that is good for training. So, I am going to not set a seed whenever I resume training.
Another is that the discriminator(s) or generator(s) is (are) possibly overpowering the opponent(s). This is apparently an issue with GANs in general.
So, if not setting the seed during resumption of training doesn’t help. I will try training the discriminators on every second iteration. If that doesn’t work I will look at training the generators less frequently.
Done
That’s it for this one. A bit of coding, a fair amount of time reading docs and a bit of learning.
Until next time, may your problem solving efforts prove more effective than mine.