Seems I was a little depressed with the lack of progress with that last attempt. Well, likely with the lack of progress to-date. Been a number of weeks and I have little or nothing to show for it. Have pretty much ignored the project for the last week or so. But, time to get back at it.

As mentioned last time I plan to start a new training regime using a learning rate scheduler. I was thinking I’d write a simple function to do the job with parameters I can specify from the command line. But as I do not run training sessions for lengthy periods of time, I will need to refactor my code to save and use the information necessary for the scheduler to work when I resume a training session.

Now, PyTorch does provide a scheduler.

I am truly having trouble staying involved with this project. The, assumed, training failure and my inability to sort out a solution has really knocked me down. It has been a week since I wrote the above. And, I still haven’t written any code to add, and test, a learning rate scheduler to the model.

Test #1

I am currently planning on running a 5 epoch test using a scheduler. I want to change the learning rate from \(0.01\) to \(0.0001\) over the 5 epochs. In the long run I will make that change over a much larger number of epochs. But that will entail modifying my saved model data to include the last learning rate and to adjust the scheduler for each extended training session (of, likely, 5 epochs).

I will also, likely in the long run, look at increasing the batch size used for training the model. For now, a small step.

For this test, I will update the learning rate every \(X\) iterations. Most of the posts/tutorials I looked at generally updated the learning rate once per epoch. But I am only doing a 5 epoch test. In the future much longer test, I will likely update the learning rate over 25 or more epochs. So, let’s say 25 steps. At 5 epochs of 667 iterations each, I am looking at \(\frac{5\space*\space667}{25}\). That is, every \(130\) iterations (or so, actually 133.4).

And, we want the learning rate to go from \(0.01\) to \(0.0001\) over those 25 steps. So, we need to multiply the current learning rate by \(.01^{1/25}\) each step.

So, threw this code in at the top of one of my modules.

g_lr_init = 0.01
expM = round((cfg.g_lr / g_lr_init)**(1 / 25), 7)
print(f"{cfg.g_lr / g_lr_init}**(1 / 25) = {expM}")
print(f"{expM}**25 = {expM**25}")
exit(0)

Which gave me the following result.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python find_bugs.test.py
0.01**(1 / 25) = 0.8317638
0.8317638**25 = 0.010000008685561362

The code is very much early development, I mainly just want to get the test working. So I am sacrificing establishing the proper global variables, any command line variable changes, proper segregation of the LR scheduler code, etc. I will get there eventually—I hope.

Damn, wasted a bunch of time on Content-Security-Policy error when trying to access this post using the local server. Also some issues with other sites. Seems to be resolved, but I am not sure why. NoScript? Will look at modifying my Hugo templates and the Netlify server to deal with the issue for some future proofing. These days, if it ain’t one thing it is always another it seems.

Back to my badly coded refactor.

... ...
  # Need to sort if statement for the desired conditions when using lr scheduler
  # perhaps some new cli variables
  use_lrs = True
  lrs_gamma = 0.8317638
  if use_lrs:
    g_lr_init = 0.01
    c_lr_init = 0.01
  else:
    g_lr_init = cfg.g_lr
    c_lr_init = cfg.c_lr
  
  # print("instantiating optimizers and losses")
  opt_genr = torch.optim.Adam(list(genr_a.parameters()) + list(genr_b.parameters()), lr=g_lr_init, betas=cfg.betas)
  opt_dA = torch.optim.Adam(disc_a.parameters(), lr=c_lr_init, betas=cfg.betas)
  opt_dB = torch.optim.Adam(disc_b.parameters(), lr=c_lr_init, betas=cfg.betas)

  if cfg.resume:
    # will need to deal with LR schedulers as apporpiate
    print(f"\nresuming training at epoch {cfg.start_ep}, loading saved states")
    ld_chkpt(cfg.sv_dir/f"generator_a.pt", genr_a, opt_genr)
    ld_chkpt(cfg.sv_dir/f"generator_b.pt", genr_b, opt_genr)
    ld_chkpt(cfg.sv_dir/f"discriminator_a.pt", disc_a, opt_dA)
    ld_chkpt(cfg.sv_dir/f"discriminator_b.pt", disc_b, opt_dB)
  
  # Need to sort if statement for the desired conditions
  # perhaps some new cli variables
  # Under the appropriate conditions, instantiate LR schedulers for each optimizer
  if use_lrs:
    # 0.831763771102671
    lrs_genr = torch.optim.lr_scheduler.StepLR(opt_genr, step_size=1, gamma=lrs_gamma)
    lrs_dA = torch.optim.lr_scheduler.StepLR(opt_dA, step_size=1, gamma=lrs_gamma)
    lrs_dB = torch.optim.lr_scheduler.StepLR(opt_dB, step_size=1, gamma=lrs_gamma)
... ...
        opt_dB.step()

      dA_prv_lr = lrs_dA.get_last_lr()
      dB_prv_lr = lrs_dB.get_last_lr()
      new_lr = dA_prv_lr * lrs_gamma
      if iteration % 130 == 0 and epoch < 5 and new_lr > cfg.c_lr:
        lrs_dA.step()
        lrs_dB.step()
        dA_curr_lr = lrs_dA.get_last_lr()
        dB_curr_lr = lrs_dB.get_last_lr()
... ...
        opt_genr.step()

      g_prv_lr = lrs_dB.get_last_lr()
      new_lr = g_prv_lr * lrs_gamma
      if iteration % 130 == 0 and epoch < 5 and new_lr > cfg.g_lr:
        lrs_genr.step()
        g_curr_lr = lrs_genr.get_last_lr()

        print(f"lr @ {iteration}: dA {dA_prv_lr} -> {dA_curr_lr}, dB {dB_prv_lr} -> {dB_curr_lr}; genr: {g_prv_lr} -> {g_curr_lr}")

And here’s the terminal output for the first and last epochs. The print statements I added to track the learning rate changes mess with the tqdm progress bar output.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400
 {'run_nm': 'rek8', 'dataset_nm': 'no_nm', 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 5, 'batch_sz': 2, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv
Setting torch seed to 73 and initializing model weights
training epochs: range(0, 5)
starting training loop
epoch:   0%|                                                                                   | 0/667 [00:00<?, ?it/s] generator training iteration 0 of 1:
epoch:  19%|██████████████                                                           | 129/667 [02:36<10:50,  1.21s/it\]
lr @ 130: dA [0.01] -> [0.008317638], dB [0.01] -> [0.008317638]; genr: [0.008317638] -> [0.008317638]
epoch:  39%|████████████████████████████▎                                            | 259/667 [05:13<08:13,  1.21s/it\]
lr @ 260: dA [0.008317638] -> [0.006918310189904401], dB [0.008317638] -> [0.006918310189904401]; genr: [0.006918310189904401] -> [0.006918310189904401]
epoch:  58%|██████████████████████████████████████████▌                              | 389/667 [07:50<05:36,  1.21s/it\]
lr @ 390: dA [0.006918310189904401] -> [0.005754399973133606], dB [0.006918310189904401] -> [0.005754399973133606]; genr: [0.005754399973133606] -> [0.005754399973133606]
epoch:  78%|████████████████████████████████████████████████████████▊                | 519/667 [10:28<03:03,  1.24s/it\]
lr @ 520: dA [0.005754399973133606] -> [0.004786301588373507], dB [0.005754399973133606] -> [0.004786301588373507]; genr: [0.004786301588373507] -> [0.004786301588373507]
epoch:  97%|███████████████████████████████████████████████████████████████████████  | 649/667 [13:08<00:22,  1.24s/it\]
lr @ 650: dA [0.004786301588373507] -> [0.003981072397091584], dB [0.004786301588373507] -> [0.003981072397091584]; genr: [0.003981072397091584] -> [0.003981072397091584]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [13:30<00:00,  1.22s/it]

... ...

epoch:   9%|██████▊                                                                   | 61/667 [01:13<12:11,  1.21s/it\]
lr @ 2730: dA [0.0002511888176880927] -> [0.0002089297655177552], dB [0.0002511888176880927] -> [0.0002089297655177552]; genr: [0.0002089297655177552] -> [0.0002089297655177552]
epoch:  29%|████████████████████▉                                                    | 191/667 [03:51<09:44,  1.23s/it\]
lr @ 2860: dA [0.0002089297655177552] -> [0.00017378021570015704], dB [0.0002089297655177552] -> [0.00017378021570015704]; genr: [0.00017378021570015704] -> [0.00017378021570015704]
epoch:  48%|███████████████████████████████████▏                                     | 321/667 [06:28<07:00,  1.22s/it\]
lr @ 2990: dA [0.00017378021570015704] -> [0.0001445440925755823], dB [0.00017378021570015704] -> [0.0001445440925755823]; genr: [0.0001445440925755823] -> [0.0001445440925755823]
epoch:  68%|█████████████████████████████████████████████████▎                       | 451/667 [09:05<04:19,  1.20s/it\]
lr @ 3120: dA [0.0001445440925755823] -> [0.00012022654370821812], dB [0.0001445440925755823] -> [0.00012022654370821812]; genr: [0.00012022654370821812] -> [0.00012022654370821812]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [13:26<00:00,  1.21s/it]

Strangley enough GPU utilization was typically under 10%? Much lower than my previous few runs.

Trouble?

The images are pretty poor so not going to show any of them. But worse is the state of the error rates. The discriminators and generator A have an error rate of \(1\) and generator B has an error rate of \(0\). That likely explains the low GPU utilization. Nothing is getting updated.

Losses for Discriminators and Generators
plot of discriminator losses for 5 epochs of training plot of generator losses for 5 epochs of training

I am at a total loss as too how to proceed from here. Perhaps start with a smaller initial loss? Or use less steps to get to the final learning rate? Or add extra training for the generators or discriminators? Perhaps give up?

Done for Now

I think it is time for me to once again step back for a bit.

Before doing so, I will tidy all the code I added to any of the modules. I will start by moving the new variables into the config module. Look at adding some more command line variables to control learning rate scheduling. Modify the code in any affected modules to use the new variables appropriately.

I have also been thinking of adding a utility function to print the information I am currently printing at the start of a training run. Consolidate it all in one location, if possible, as I think those print statements are currently in various places in the modules.

I may or may not include those changes in this post.

I will also likely review all my code for the project to make sure I don’t have any other errors I missed before.

Sometimes you just need to give your mind a rest. I am sure the Beast will also appreciate the rest. Until next time…