Okay, I am thinking I will start a draft post to keep some notes on my refactoring of that test code for using a learning rate scheduler. May never post it but it seems like it’s worth doing so—I am prone to forgetting the whys and whats of my code and there are few to no comments or class/function/module documentation in general.

Update Global Variables and Command Line Arguments

I will start by setting up a number of variables in the config module. I will also add command line options for most of them. And, I will add new variables for the number of steps the decay should take and at what interval a scheduler step should occur. Those were 25 and 130 respectively in the first post on using a learning rate scheduler.

Initial attempt looks like the following.

... ...

# learning rate scheduler
use_lrs = True              # use a learning rate scheduler?
lrs_unit = 'batch'          # decay over 'batch'es or 'epoch's
lrs_eps = 5                 # how many epochs to decay over
lrs_init = 0.01             # intiial learning rate if using learning rate scheduler
                            #   using torch.optim.lr_scheduler.StepLR
lrs_steps = 25              # number of steps to go from init lr to final lr, using
# these next two will need to be recalculated once cl args obtained
# and we will also need number of batches per epoch if lrs_unit is batch
                            # gamma value to use with StepLR
lrs_gamma = (g_lr / lrs_init)**(1/lrs_steps)
lrs_interval =  130         # number of iterations before executing lrs step

... ...

  parser.add_argument("-lrs", "--use_lrs", action="store_true", required=False, help="Use a learning rate scheduler")
  parser.add_argument("-lru", "--lrs_unit", type=str, required=False, help="LR scheduler unit, 'batch' or 'epoch'")
  parser.add_argument("-lre", "--lrs_eps", type=int, required=False, help="Number of epochs LR scheduler to decay over")
  parser.add_argument("-lri", "--lrs_init", type=float, required=False, help="Initial LR scheduler learning rate value")
  parser.add_argument("-lrp", "--lrs_steps", type=float, required=False, help="Total number of steps for LRS decay")

I also modified the values in the if blocks used to check whether or not a LR scheduler step should be taken in the main module, cyc_gan.

... ...

      if iteration % cfg.lrs_interval == 0 and epoch < cfg.lrs_eps and new_lr > cfg.c_lr:
        lrs_dA.step()

... ...

      if iteration % cfg.lrs_interval == 0 and epoch < cfg.lrs_eps and new_lr > cfg.c_lr:
        lrs_genr.step()

New Function

I am going to add a function to use the values of the LR scheduler variables to generate the gamma and interval values once the command line arguments have been processed. This function will need to called from the cyc_gan module as it will need the number of batches per epoch.

def updt_lrs_vars(len_ds):
  global lrs_gamma, lrs_interval
  lrs_gamma = (g_lr / lrs_init)**(1/lrs_steps)
  nbr_iters = int(len_ds / batch_sz)
  if lrs_unit == 'batch':
    lrs_interval = int(lrs_eps * nbr_iters / lrs_steps)
    # round to nearest multiple of 10
    lrs_interval = int((lrs_interval // 10) * 10)
  else:
    lrs_interval = nbr_iters

And, I added a bit of test to the module. Refactoring previous tests as well.

if __name__ == "__main__":
... ...

if __name__ == "__main__":
  from datasets import LoadData

... ...

if __name__ == "__main__":
  cl_args = get_cl_args()
  print("before update:\n", cl_args)
  updt_cl_args(cl_args)
  print("after update:\n", cl_args)
  mk_dirs()

  get_data = LoadData(ds_A, ds_B, trfs=trfs)
  updt_lrs_vars(get_data.__len__())
  print("\nlrs calculated variables:", lrs_gamma, lrs_interval)

And running a couple of tests produced the following terminal output.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python config.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400 -lru batch
before update:
 {'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': None, 'lrs_init': None, 'lrs_steps': None}
after update:
 {'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': None, 'lrs_init': None, 'lrs_steps': None}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv

lrs calculated variables: 0.831763771102671 130

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python config.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400 -lru epoch
before update:
 {'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': False, 'lrs_unit': 'epoch', 'lrs_eps': None, 'lrs_init': None, 'lrs_steps': None}
after update:
 {'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': False, 'lrs_unit': 'epoch', 'lrs_eps': None, 'lrs_init': None, 'lrs_steps': None}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv

lrs calculated variables: 0.831763771102671 667

And that first set of values matches what we calculated in the first post on using a learning rate scheduler.

Already Another Refactor

I was going to move on to refactoring the cyc_gan module. But it crossed my mind that I might want to train the networks at that initial higher learning rate for some number of epochs before beginning to decay the learning rate for some number of steps. So I am going to add another global variable, lrs_wmup and a command line argurment of the same name (wmup == warmup, was going to use wait but…).

Here are the changes with some extra lines for context.

.. ...
lrs_eps = 5                 # how many epochs to decay over
lrs_wmup = 0                # nbr of epochs to use initial learning rate before beginning decay
lrs_init = 0.01             # intiial learning rate if using learning rate scheduler

... ...

  parser.add_argument("-lrp", "--lrs_steps", type=float, required=False, help="Total number of steps for LRS decay")
  parser.add_argument("-lrw", "--lrs_wmup", type=int, required=False, help="Number of epochs of warmup using intial learning rate")

And a quick test.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python config.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400 -lrs -lru batch -lre 2 -lrw 2 -lrp 15
before update:
 {'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': True, 'lrs_unit': 'batch', 'lrs_eps': 2, 'lrs_init': None, 'lrs_steps': 15.0, 'lrs_wmup': 2}
after update:
 {'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': True, 'lrs_unit': 'batch', 'lrs_eps': 2, 'lrs_init': None, 'lrs_steps': 15.0, 'lrs_wmup': 2}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv

lrs calculated variables: 0.7356422544596414 80

So that set of arguments, training will use a decay multiplier of \(~0.735\) every \(80\) iterations for the third and fourth epochs of training. The first two epochs will use the default initial learning rate (\(.01\)). And the fifth epoch will use the final learning rate for all its iterations. Currently \(0.0001\) for all networks (discriminators and generators).

To Call or Not To Call

I currently use an if statement with three separate logicals anded to together. Plus a couple extra steps to generate a value needed by the if statement.

      g_prv_lr = lrs_genr.get_last_lr()
      new_lr = g_prv_lr[0] * lrs_gamma
      if iteration % cfg.lrs_interval == 0 and epoch < cfg.lrs_eps and new_lr > cfg.c_lr:

With the decay delay to consider it is going to get a touch more convoluted. So, I think it best to add a new function to determine whether or not we should call the learning rate scheduler’s step method. It is really pretty straightforward, so…

def do_lr_step(epoch, iter, lrs):
  do_step = False
  ep_last = cfg.lrs_wmup + cfg.lrs_eps
  g_prv_lr = lrs.get_last_lr()
  new_lr = g_prv_lr[0] * cfg.lrs_gamma
  if epoch >= cfg.lrs_wmup and epoch < ep_last:
    do_step = iter % cfg.lrs_interval == 0 and new_lr > cfg.c_lr
  return do_step

Test New Function

I added a test to the utils module. This proved a touch more involved than I initially expected. I also had to add some imports to get it to work. I repeated the test from the config module as I wanted that data available if there were problems.

In the following, do recall that epoch is zero-based and that iteration is one-based. Maybe should make them both the same, but that would involved modifying the currently simple arithmetic tests. So not sure there would be any real benefit.

... ...
if __name__ == "__main__":
  from datasets import LoadData
  from models import Discriminator, Generator
... ...
if __name__ == "__main__":
  cl_args = cfg.get_cl_args()
  cfg.updt_cl_args(cl_args)
  print("after update:\n", cl_args)
  cfg.mk_dirs()

  get_data = LoadData(cfg.ds_A, cfg.ds_B, trfs=cfg.trfs)
  cfg.updt_lrs_vars(get_data.__len__())
  print("\nlrs calculated variables:", cfg.lrs_gamma, cfg.lrs_interval)

  # test do_lr_step() function
  # set up model, optimizer, scheduler
  genr_a = Generator(init_feats=cfg.init_features, nbr_rblks=cfg.num_res_blks).to(cfg.device)
  genr_b = Generator(init_feats=cfg.init_features, nbr_rblks=cfg.num_res_blks).to(cfg.device)
  opt_genr = torch.optim.Adam(list(genr_a.parameters()) + list(genr_b.parameters()), lr=cfg.lrs_init, betas=cfg.betas)
  lrs_genr = torch.optim.lr_scheduler.StepLR(opt_genr, step_size=1, gamma=cfg.lrs_gamma)

  epoch = cfg.lrs_wmup
  iter = cfg.lrs_interval * 2
  do_step = do_lr_step(epoch, iter, lrs_genr)
  print(f"\nepoch: {epoch}, iteration: {iter}, do scheduler step: {do_step}")
  iter = cfg.lrs_interval * 2 - 2
  do_step = do_lr_step(epoch, iter, lrs_genr)
  print(f"\nepoch: {epoch}, iteration: {iter}, do scheduler step: {do_step}")
  epoch = cfg.lrs_wmup - 1
  iter = cfg.lrs_interval * 2
  do_step = do_lr_step(epoch, iter, lrs_genr)
  print(f"\nepoch: {epoch}, iteration: {iter}, do scheduler step: {do_step}")
  epoch = cfg.lrs_wmup + cfg.lrs_eps
  do_step = do_lr_step(epoch, iter, lrs_genr)
  print(f"\nepoch: {epoch}, iteration: {iter}, do scheduler step: {do_step}")

And…

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python utils.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400 -lrs -lru batch -lre 2 -lrw 2 -lrp 15
after update:
 {'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': True, 'lrs_unit': 'batch', 'lrs_eps': 2, 'lrs_init': None, 'lrs_steps': 15.0, 'lrs_wmup': 2}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv

lrs calculated variables: 0.7356422544596414 80

epoch: 2, iteration: 160, do scheduler step: True

epoch: 2, iteration: 158, do scheduler step: False

epoch: 1, iteration: 160, do scheduler step: False

epoch: 4, iteration: 160, do scheduler step: False

Continue Refactor of cyc_gan Module

Not sure there is a lot to be done, but definitely a little. If nothing else, we need to call that new function to set the learning rate scheduler values appropriately. Not sure where to put it. But will stick with the approach I used for command line arguments and put it in the cyc_gan module. And, we will need to modify the if blocks determining whether or not to execute a scheduler to use the new function.

And, as with any refactor, earlier decisions need to be reconsidered and/or corrected. So that block of code used to set some training and/or scheduler parameters will need to be modified to use the new globals.

Here are the updated lines of code, with context lines as appropriate. Most of it is just using the appropriate config module variables and that new function.

Because of where I put the call to updt_lrs_vars(), I had to move the data loading code above that particular section so that I had access the dataset’s __len__ method.

... ...
from utils import image_grid, tensor2image, sv_chkpt, ld_chkpt, weights_init, do_lr_step
... ...
cl_args = cfg.get_cl_args()
cfg.updt_cl_args(cl_args)
cfg.print_cl_args(cl_args)
cfg.mk_dirs()
# calculate/set learning rate scheduler values
cfg.updt_lrs_vars(get_data.__len__())
... ...
  use_lrs = True
  if use_lrs:
    g_lr_init = cfg.lrs_init
    c_lr_init = cfg.lrs_init
  else:
    g_lr_init = cfg.g_lr
    c_lr_init = cfg.c_lr
... ...
  if use_lrs:
    lrs_genr = torch.optim.lr_scheduler.StepLR(opt_genr, step_size=1, gamma=cfg.lrs_gamma)
    lrs_dA = torch.optim.lr_scheduler.StepLR(opt_dA, step_size=1, gamma=cfg.lrs_gamma)
    lrs_dB = torch.optim.lr_scheduler.StepLR(opt_dB, step_size=1, gamma=cfg.lrs_gamma)
... ...
      if do_lr_step(epoch, iteration, lrs_dA):
        lrs_dA.step()
        lrs_dB.step()
... ...
      if do_lr_step(epoch, iteration, lrs_genr):git f
        lrs_genr.step()

Test Run

I now want to do a 5 epoch test run. There will be a two epoch warmup. Then two epochs of decay followed by one epoch at the final learning rate. The learning rate will decay from \(0.001\) to \(0.0001\). The goal is to see if using a warmup helps prevent that training collapse we experienced with the first attempt to use a learning rate scheduler. Five epochs will hardly get the cycle gan trained.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek9 -ep 5 -bs 2 -sc 50 -si 400 -lrs -lru batch -lre 2 -lrw 2 -lrp 15
 {'run_nm': 'rek9', 'dataset_nm': 'no_nm', 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 5, 'batch_sz': 2, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': True, 'lrs_unit': 'batch', 'lrs_eps': 2, 'lrs_init': 0.01, 'lrs_steps': 15.0, 'lrs_wmup': 2}
image and checkpoint directories created: runs\rek9_img & runs\rek9_sv
Setting torch seed to 73 and initializing model weights
training epochs: range(0, 5)
starting training loop
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:37<00:00,  1.55s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:42<00:00,  1.56s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:41<00:00,  1.56s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:36<00:00,  1.55s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:36<00:00,  1.55s/it]
Losses for Discriminators and Generators
plot of discriminator losses for 5 epochs of training using a learning rate scheduler plot of generator losses for 5 epochs of training using a learning rate scheduler

Once again the GPU utilization hung out around \(5\text{-}10\text{%}\) compared to \(45\text{-}60\text{%}\) in earlier sessions. I.E. sessions before I started messing with a learning rate scheduler. As I don’t believe this is a good thing, I am going to try to figure out why that has happened. I expect I introduced a bug that needs fixing. Maybe even more than one bug.

Done

So this post is done. And I will get to work debugging my code. In my case, that will be lots of print statements; perhaps in if blocks for possible future reuse.

Until next time, I hope your CycleGAN training is progressing considerably better than mine has to-date.

Afterword

I was actually thinking about stepping away from this project for a while. Maybe permanently. So I was looking at the next possible project: an autoencoder. Likely a variational autoencoder.

An autoencoder model is made up of a pair of fully connected, feedforward neural networks: an encoder and a decoder. Sound familiar? In an autoencoder the two networks are separate and used separately. The encoder compresses the input data to remove noise and generate a latent space. The decoder uses a compressed data representation from the latent space to try and reconstruct the original input data.

I now believe that the generator model for our CycleGAN embodied the concepts of an autoencoder to do its job. Finally another if, and or but possibly explained.

Perhaps a Postscript

While playing around trying to sort my issues with the CycleGAN I discovered that the GPU was in fact running at \(50\text{-}70\text{%}\) utilization. Not \(5\text{-}10\text{%}\) as displayed in Window’s Task Manager. This is apparently a known issue, at least for some window’s PCs. But I have no idea why it only started on my system within the past few weeks. I am guessing a window’s update did me in. Have updated the Nvidia drivers. Will see if that fixes anything.

Note, I discovered the above by running nvidia-smi in a another terminal window while running a training session. Then googling for answers.