Okay, I am thinking I will start a draft post to keep some notes on my refactoring of that test code for using a learning rate scheduler. May never post it but it seems like it’s worth doing so—I am prone to forgetting the whys and whats of my code and there are few to no comments or class/function/module documentation in general.
Update Global Variables and Command Line Arguments
I will start by setting up a number of variables in the config module. I will also add command line options for most of them. And, I will add new variables for the number of steps the decay should take and at what interval a scheduler step should occur. Those were 25 and 130 respectively in the first post on using a learning rate scheduler.
Initial attempt looks like the following.
... ...
# learning rate scheduler
use_lrs = True # use a learning rate scheduler?
lrs_unit = 'batch' # decay over 'batch'es or 'epoch's
lrs_eps = 5 # how many epochs to decay over
lrs_init = 0.01 # intiial learning rate if using learning rate scheduler
# using torch.optim.lr_scheduler.StepLR
lrs_steps = 25 # number of steps to go from init lr to final lr, using
# these next two will need to be recalculated once cl args obtained
# and we will also need number of batches per epoch if lrs_unit is batch
# gamma value to use with StepLR
lrs_gamma = (g_lr / lrs_init)**(1/lrs_steps)
lrs_interval = 130 # number of iterations before executing lrs step
... ...
parser.add_argument("-lrs", "--use_lrs", action="store_true", required=False, help="Use a learning rate scheduler")
parser.add_argument("-lru", "--lrs_unit", type=str, required=False, help="LR scheduler unit, 'batch' or 'epoch'")
parser.add_argument("-lre", "--lrs_eps", type=int, required=False, help="Number of epochs LR scheduler to decay over")
parser.add_argument("-lri", "--lrs_init", type=float, required=False, help="Initial LR scheduler learning rate value")
parser.add_argument("-lrp", "--lrs_steps", type=float, required=False, help="Total number of steps for LRS decay")
I also modified the values in the if
blocks used to check whether or not a LR scheduler step should be taken in the main module, cyc_gan
.
... ...
if iteration % cfg.lrs_interval == 0 and epoch < cfg.lrs_eps and new_lr > cfg.c_lr:
lrs_dA.step()
... ...
if iteration % cfg.lrs_interval == 0 and epoch < cfg.lrs_eps and new_lr > cfg.c_lr:
lrs_genr.step()
New Function
I am going to add a function to use the values of the LR scheduler variables to generate the gamma and interval values once the command line arguments have been processed. This function will need to called from the cyc_gan
module as it will need the number of batches per epoch.
def updt_lrs_vars(len_ds):
global lrs_gamma, lrs_interval
lrs_gamma = (g_lr / lrs_init)**(1/lrs_steps)
nbr_iters = int(len_ds / batch_sz)
if lrs_unit == 'batch':
lrs_interval = int(lrs_eps * nbr_iters / lrs_steps)
# round to nearest multiple of 10
lrs_interval = int((lrs_interval // 10) * 10)
else:
lrs_interval = nbr_iters
And, I added a bit of test to the module. Refactoring previous tests as well.
if __name__ == "__main__":
... ...
if __name__ == "__main__":
from datasets import LoadData
... ...
if __name__ == "__main__":
cl_args = get_cl_args()
print("before update:\n", cl_args)
updt_cl_args(cl_args)
print("after update:\n", cl_args)
mk_dirs()
get_data = LoadData(ds_A, ds_B, trfs=trfs)
updt_lrs_vars(get_data.__len__())
print("\nlrs calculated variables:", lrs_gamma, lrs_interval)
And running a couple of tests produced the following terminal output.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python config.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400 -lru batch
before update:
{'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': None, 'lrs_init': None, 'lrs_steps': None}
after update:
{'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': None, 'lrs_init': None, 'lrs_steps': None}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv
lrs calculated variables: 0.831763771102671 130
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python config.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400 -lru epoch
before update:
{'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': False, 'lrs_unit': 'epoch', 'lrs_eps': None, 'lrs_init': None, 'lrs_steps': None}
after update:
{'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': False, 'lrs_unit': 'epoch', 'lrs_eps': None, 'lrs_init': None, 'lrs_steps': None}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv
lrs calculated variables: 0.831763771102671 667
And that first set of values matches what we calculated in the first post on using a learning rate scheduler.
Already Another Refactor
I was going to move on to refactoring the cyc_gan
module. But it crossed my mind that I might want to train the networks at that initial higher learning rate for some number of epochs before beginning to decay the learning rate for some number of steps. So I am going to add another global variable, lrs_wmup
and a command line argurment of the same name (wmup
== warmup, was going to use wait
but…).
Here are the changes with some extra lines for context.
.. ...
lrs_eps = 5 # how many epochs to decay over
lrs_wmup = 0 # nbr of epochs to use initial learning rate before beginning decay
lrs_init = 0.01 # intiial learning rate if using learning rate scheduler
... ...
parser.add_argument("-lrp", "--lrs_steps", type=float, required=False, help="Total number of steps for LRS decay")
parser.add_argument("-lrw", "--lrs_wmup", type=int, required=False, help="Number of epochs of warmup using intial learning rate")
And a quick test.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python config.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400 -lrs -lru batch -lre 2 -lrw 2 -lrp 15
before update:
{'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': True, 'lrs_unit': 'batch', 'lrs_eps': 2, 'lrs_init': None, 'lrs_steps': 15.0, 'lrs_wmup': 2}
after update:
{'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': True, 'lrs_unit': 'batch', 'lrs_eps': 2, 'lrs_init': None, 'lrs_steps': 15.0, 'lrs_wmup': 2}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv
lrs calculated variables: 0.7356422544596414 80
So that set of arguments, training will use a decay multiplier of \(~0.735\) every \(80\) iterations for the third and fourth epochs of training. The first two epochs will use the default initial learning rate (\(.01\)). And the fifth epoch will use the final learning rate for all its iterations. Currently \(0.0001\) for all networks (discriminators and generators).
To Call or Not To Call
I currently use an if
statement with three separate logicals and
ed to together. Plus a couple extra steps to generate a value needed by the if
statement.
g_prv_lr = lrs_genr.get_last_lr()
new_lr = g_prv_lr[0] * lrs_gamma
if iteration % cfg.lrs_interval == 0 and epoch < cfg.lrs_eps and new_lr > cfg.c_lr:
With the decay delay to consider it is going to get a touch more convoluted. So, I think it best to add a new function to determine whether or not we should call the learning rate scheduler’s step
method. It is really pretty straightforward, so…
def do_lr_step(epoch, iter, lrs):
do_step = False
ep_last = cfg.lrs_wmup + cfg.lrs_eps
g_prv_lr = lrs.get_last_lr()
new_lr = g_prv_lr[0] * cfg.lrs_gamma
if epoch >= cfg.lrs_wmup and epoch < ep_last:
do_step = iter % cfg.lrs_interval == 0 and new_lr > cfg.c_lr
return do_step
Test New Function
I added a test to the utils module. This proved a touch more involved than I initially expected. I also had to add some imports to get it to work. I repeated the test from the config
module as I wanted that data available if there were problems.
In the following, do recall that epoch
is zero-based and that iteration
is one-based. Maybe should make them both the same, but that would involved modifying the currently simple arithmetic tests. So not sure there would be any real benefit.
... ...
if __name__ == "__main__":
from datasets import LoadData
from models import Discriminator, Generator
... ...
if __name__ == "__main__":
cl_args = cfg.get_cl_args()
cfg.updt_cl_args(cl_args)
print("after update:\n", cl_args)
cfg.mk_dirs()
get_data = LoadData(cfg.ds_A, cfg.ds_B, trfs=cfg.trfs)
cfg.updt_lrs_vars(get_data.__len__())
print("\nlrs calculated variables:", cfg.lrs_gamma, cfg.lrs_interval)
# test do_lr_step() function
# set up model, optimizer, scheduler
genr_a = Generator(init_feats=cfg.init_features, nbr_rblks=cfg.num_res_blks).to(cfg.device)
genr_b = Generator(init_feats=cfg.init_features, nbr_rblks=cfg.num_res_blks).to(cfg.device)
opt_genr = torch.optim.Adam(list(genr_a.parameters()) + list(genr_b.parameters()), lr=cfg.lrs_init, betas=cfg.betas)
lrs_genr = torch.optim.lr_scheduler.StepLR(opt_genr, step_size=1, gamma=cfg.lrs_gamma)
epoch = cfg.lrs_wmup
iter = cfg.lrs_interval * 2
do_step = do_lr_step(epoch, iter, lrs_genr)
print(f"\nepoch: {epoch}, iteration: {iter}, do scheduler step: {do_step}")
iter = cfg.lrs_interval * 2 - 2
do_step = do_lr_step(epoch, iter, lrs_genr)
print(f"\nepoch: {epoch}, iteration: {iter}, do scheduler step: {do_step}")
epoch = cfg.lrs_wmup - 1
iter = cfg.lrs_interval * 2
do_step = do_lr_step(epoch, iter, lrs_genr)
print(f"\nepoch: {epoch}, iteration: {iter}, do scheduler step: {do_step}")
epoch = cfg.lrs_wmup + cfg.lrs_eps
do_step = do_lr_step(epoch, iter, lrs_genr)
print(f"\nepoch: {epoch}, iteration: {iter}, do scheduler step: {do_step}")
And…
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python utils.py -rn rek8 -ep 5 -bs 2 -sc 50 -si 400 -lrs -lru batch -lre 2 -lrw 2 -lrp 15
after update:
{'run_nm': 'rek8', 'dataset_nm': None, 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': None, 'epochs': 5, 'batch_sz': 2, 'image_sz': None, 'num_res_blks': None, 'x_disc': None, 'x_genr': None, 'x_eps': None, 'use_lrs': True, 'lrs_unit': 'batch', 'lrs_eps': 2, 'lrs_init': None, 'lrs_steps': 15.0, 'lrs_wmup': 2}
image and checkpoint directories created: runs\rek8_img & runs\rek8_sv
lrs calculated variables: 0.7356422544596414 80
epoch: 2, iteration: 160, do scheduler step: True
epoch: 2, iteration: 158, do scheduler step: False
epoch: 1, iteration: 160, do scheduler step: False
epoch: 4, iteration: 160, do scheduler step: False
Continue Refactor of cyc_gan
Module
Not sure there is a lot to be done, but definitely a little. If nothing else, we need to call that new function to set the learning rate scheduler values appropriately. Not sure where to put it. But will stick with the approach I used for command line arguments and put it in the cyc_gan
module. And, we will need to modify the if
blocks determining whether or not to execute a scheduler to use the new function.
And, as with any refactor, earlier decisions need to be reconsidered and/or corrected. So that block of code used to set some training and/or scheduler parameters will need to be modified to use the new globals.
Here are the updated lines of code, with context lines as appropriate. Most of it is just using the appropriate config
module variables and that new function.
Because of where I put the call to updt_lrs_vars()
, I had to move the data loading code above that particular section so that I had access the dataset’s __len__
method.
... ...
from utils import image_grid, tensor2image, sv_chkpt, ld_chkpt, weights_init, do_lr_step
... ...
cl_args = cfg.get_cl_args()
cfg.updt_cl_args(cl_args)
cfg.print_cl_args(cl_args)
cfg.mk_dirs()
# calculate/set learning rate scheduler values
cfg.updt_lrs_vars(get_data.__len__())
... ...
use_lrs = True
if use_lrs:
g_lr_init = cfg.lrs_init
c_lr_init = cfg.lrs_init
else:
g_lr_init = cfg.g_lr
c_lr_init = cfg.c_lr
... ...
if use_lrs:
lrs_genr = torch.optim.lr_scheduler.StepLR(opt_genr, step_size=1, gamma=cfg.lrs_gamma)
lrs_dA = torch.optim.lr_scheduler.StepLR(opt_dA, step_size=1, gamma=cfg.lrs_gamma)
lrs_dB = torch.optim.lr_scheduler.StepLR(opt_dB, step_size=1, gamma=cfg.lrs_gamma)
... ...
if do_lr_step(epoch, iteration, lrs_dA):
lrs_dA.step()
lrs_dB.step()
... ...
if do_lr_step(epoch, iteration, lrs_genr):git f
lrs_genr.step()
Test Run
I now want to do a 5 epoch test run. There will be a two epoch warmup. Then two epochs of decay followed by one epoch at the final learning rate. The learning rate will decay from \(0.001\) to \(0.0001\). The goal is to see if using a warmup helps prevent that training collapse we experienced with the first attempt to use a learning rate scheduler. Five epochs will hardly get the cycle gan trained.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek9 -ep 5 -bs 2 -sc 50 -si 400 -lrs -lru batch -lre 2 -lrw 2 -lrp 15
{'run_nm': 'rek9', 'dataset_nm': 'no_nm', 'sv_img_cyc': 400, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 5, 'batch_sz': 2, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': True, 'lrs_unit': 'batch', 'lrs_eps': 2, 'lrs_init': 0.01, 'lrs_steps': 15.0, 'lrs_wmup': 2}
image and checkpoint directories created: runs\rek9_img & runs\rek9_sv
Setting torch seed to 73 and initializing model weights
training epochs: range(0, 5)
starting training loop
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:37<00:00, 1.55s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:42<00:00, 1.56s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:41<00:00, 1.56s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:36<00:00, 1.55s/it]
epoch: 100%|█████████████████████████████████████████████████████████████████████████| 334/334 [08:36<00:00, 1.55s/it]
Once again the GPU utilization hung out around \(5\text{-}10\text{%}\) compared to \(45\text{-}60\text{%}\) in earlier sessions. I.E. sessions before I started messing with a learning rate scheduler. As I don’t believe this is a good thing, I am going to try to figure out why that has happened. I expect I introduced a bug that needs fixing. Maybe even more than one bug.
Done
So this post is done. And I will get to work debugging my code. In my case, that will be lots of print statements; perhaps in if
blocks for possible future reuse.
Until next time, I hope your CycleGAN training is progressing considerably better than mine has to-date.
Afterword
I was actually thinking about stepping away from this project for a while. Maybe permanently. So I was looking at the next possible project: an autoencoder. Likely a variational autoencoder.
An autoencoder model is made up of a pair of fully connected, feedforward neural networks: an encoder and a decoder. Sound familiar? In an autoencoder the two networks are separate and used separately. The encoder compresses the input data to remove noise and generate a latent space. The decoder uses a compressed data representation from the latent space to try and reconstruct the original input data.
I now believe that the generator model for our CycleGAN embodied the concepts of an autoencoder to do its job. Finally another if, and or but possibly explained.
Perhaps a Postscript
While playing around trying to sort my issues with the CycleGAN I discovered that the GPU was in fact running at \(50\text{-}70\text{%}\) utilization. Not \(5\text{-}10\text{%}\) as displayed in Window’s Task Manager. This is apparently a known issue, at least for some window’s PCs. But I have no idea why it only started on my system within the past few weeks. I am guessing a window’s update did me in. Have updated the Nvidia drivers. Will see if that fixes anything.
Note, I discovered the above by running nvidia-smi in a another terminal window while running a training session. Then googling for answers.