Looks like another possibly never ending project (and series of posts). Seems to be a bad habit of mine.

Logging Losses

I think I will define a logger class to handle storing all the loss values and writing them to file separately from the network checkpoints. Not yet sure how to proceed, but that’s never stopped me before. The initialization method will create the lists for storing the values for the individual loss types (a goodly number). There will be at least a method to update the lists and one to write them to file. In fact, after some playing around, I also added methods to load the loss data from a file and to plot the data. I am for now going to put the class in its own module. Expect there will be a fair bit of trial and error getting it written.

And an error there was. I already know there will be more refactoring. And possibly some new code. The test code in the module just used randomly generated numbers (Numpy random class/functions). When I ran a test epoch to try out the new class and related refactoring of the cyc_gan module, I eventually got an error (see below).

import matplotlib.pyplot as plt
from pathlib import Path
import torch

import config as cfg

class Logger():
  def __init__(self, run_nm, freq, loss_nms):
    self.run_nm = run_nm    # run for these losses
    self.epoch = None       # final epoch for these losses
    self.iter = None        # final iter for these losses
    self.freq = freq        # nbr of iterations between logging of loss data
    self.losses = {}        # dictionary for losses, each value will be list of losses for the keyed loss
    for l_nm in loss_nms:
      self.losses[l_nm] = []


  def log_losses(self, losses):
    for l_nm, l_val in losses.items():
      self.losses[l_nm].append(l_val)

  
  def to_file(self, p_dir, epoch, iter=None):
    self.epoch = epoch
    self.iter = iter
    s_iter = f"_{iter}" if iter else ""
    fl_nm = f"losses_{epoch}{s_iter}.pt"
    fl_pth = p_dir/fl_nm
    torch.save({
      'run_nm': self.run_nm,
      'epoch': self.epoch,
      'iter': self.iter,
      'log_freq': self.freq,
      'losses': self.losses
      }, fl_pth)

  
  def from_file(self, p_dir, epoch, iter=None):
    s_iter = f"_{iter}" if iter else ""
    fl_nm = f"losses_{epoch}{s_iter}.pt"
    fl_pth = p_dir/fl_nm
    log_data = torch.load(fl_pth)
    self.run_nm = log_data["run_nm"]
    self.epoch = log_data["epoch"]
    self.iter = log_data["iter"]
    self.log_freq = log_data["log_freq"]
    self.losses = log_data["losses"]


  def plot_losses(self, epoch=None, iter=None):
    keys = list(self.losses.keys())
    xs = list(range(len(self.losses[keys[1]])))
    fig = plt.figure(figsize=(8, 8))
    for ky in keys:
      plt.plot(xs, self.losses[ky], label=ky)
    plt.legend()
    s_iter = f" / {self.iter}" if self.iter else ""
    plt.title(f"Losses for {self.run_nm} / {self.epoch}{s_iter}")
    plt.show()


if __name__ == "__main__":
  cl_args = cfg.get_cl_args()
  cfg.updt_cl_args(cl_args)
  cfg.mk_dirs()
  loss_nms = ["d_a", "d_b", "disc", "g_a", "g_b", "cyc_a", "cyc_b", "id_a", "id_b", "genr"]

  loss_log = Logger(cfg.run_nm, 50, loss_nms)
  for i in range(10):
    losses = {lnm: cfg.rng.integers(0, 4) + cfg.rng.random() for lnm in loss_nms}
    loss_log.log_losses(losses)
  log_1 = loss_log.losses

  loss_log.to_file(cfg.sv_dir, 0, 333)
  loss_log = None

  loss_log = Logger(cfg.run_nm, 50, loss_nms)
  loss_log.from_file(cfg.sv_dir, 0, 333)
  print(f"log_1 == loss_log.losses: {log_1 == loss_log.losses}")
  loss_log.plot_losses()

And, yes log_1 and loss_log.losses were equal. And the plot was generated.

Refactor Logging Code in cyc_gan.py

I removed all the old code that periodically saved the total discriminator and generator losses for inclusion in the checkpoint files. I then added code to instantiate the logger, to periodically save the current loss values and eventually to save the data to a file and plot the losses. I will let you figure out where the following bits of code were put in the module code.

... ...
if trn_model:

  # set up logger for losses
  loss_nms = ["d_a", "d_b", "disc", "g_a", "g_b", "cyc_a", "cyc_b", "id_a", "id_b", "genr"]
  lgr_loss = Logger(cfg.run_nm, cfg.sv_chk_cyc, loss_nms)
... ...
      if iteration % cfg.sv_chk_cyc == 0:
        all_losses = {"d_a" : da_loss, "d_b": db_loss,
                      "disc": disc_loss,
                      "g_a": ga_loss, "g_b": gb_loss,
                      "cyc_a": ga_cyc_loss, "cyc_b": gb_cyc_loss,
                      "id_a": ga_id_loss, "id_b": gb_id_loss,
                      "genr": genr_loss}
        lgr_loss.log_losses(all_losses)
... ...
  # training epochs complete, save last batch of losses
  all_losses = {"d_a" : da_loss, "d_b": db_loss,
                "disc": disc_loss,
                "g_a": ga_loss, "g_b": gb_loss,
                "cyc_a": ga_cyc_loss, "cyc_b": gb_cyc_loss,
                "id_a": ga_id_loss, "id_b": gb_id_loss,
                "genr": genr_loss}
  lgr_loss.log_losses(all_losses)
  lgr_loss.to_file(cfg.sv_dir, epoch, iteration)
  lgr_loss.plot_losses()
... ...

I also had to refactor the sv_chkpt() function to not expect or save the loss lists as it did previously.

Test and Bug

I decided to run training for a single epoch to see if the changes to use the logger class worked. As mentioned above, it did have a problem. Specifically when trying to plot the losses.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek5 -ep 1 -bs 2 -sc 50 -si 300 -xd 3 -xe 3
image and checkpoint directories created: runs\rek5_img & runs\rek5_sv
training epochs: range(0, 1)
starting training loop
epoch 1: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [11:32<00:00,  1.04s/it]
Traceback (most recent call last):
  File "F:\learn\mcl_pytorch\proj6\cyc_gan.py", line 282, in <module>
    lgr_loss.plot_losses()
... ...
TypeError: can't convert cuda:1 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Basically, I was saving the tensors with the loss values directly from the GPU to my dictionary of lists. I was able to save that dictionary to file. But when I tried to plot the loss data I got the above error. So, I changed, in the appropriate locations in cyc_gan.py, all the assignments in the creation of the dictionary to be passed to the logger from all_losses = {"d_a" : da_loss, "d_b": db_loss,... to all_losses = {"d_a" : da_loss.item(), "d_b": db_loss.item(),.... The loss tensors were all single element tnesors, so I decided to just get that value.

Refactor #2

Extra Network Training per Iteration

Okay, I am going to modify the training loop to allow for additional training for either the discriminators or the generators. Though I will initially start with additional training for the discriminators. That will involve a few new global variables. And, as discussed above, I plan to save more loss data for each of the networks. Expect the refactoring and testing is going to take me a day or three.

Added following global variables to config.py.

... ...
x_disc = 2                  # how often the discriminator is trained for each time the generator is trained
x_genr = 1                  # how often the generator is trained for each time the discriminator is trained
x_eps = 5                   # how many epochs of extra training for the networks
... ...

Then in main project module added the following in appropriate locations. Some prior code included for context. Indented the original network training code as appropriate.

... ...
  for epoch in range(strt_epoch, end_epoch):
    # done with extra training of networks?
    if epoch >= cfg.x_eps:
      cfg.x_disc = 1
      cfg.x_genr = 1
... ...
      # Discriminators
      # perhaps train discriminator more often during each iteration
      for _ in range(cfg.x_disc):
        pred_a_real = disc_a(img_a.detach())
... ...
      # Generators
      # perhaps train generator more often during each iteration
      for _ in range(cfg.x_genr):
        pred_a_fake = disc_a(fake_a)
... ...

I did see a post where the author trained the two discriminators separately. I did think about doing that as it seems one conversion does better than the reverse. But for now will just make the single change and train the discrimintors more often.

Training Run #1

Okay, let’s train for 10 epochs, with a batch size of 2. The discriminator will be trained 3 times per iteration for 3 epochs.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek5 -ep 10 -bs 2 -sc 50 -si 300 -xd 3 -xe 3
image and checkpoint directories created: runs\rek5_img & runs\rek5_sv
training epochs: range(0, 10)
starting training loop
epoch 1: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [11:28<00:00,  1.03s/it]
... ...
epoch 3: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [11:26<00:00,  1.03s/it]
epoch 4: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [09:56<00:00,  1.12it/s]
... ...
epoch 10: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [09:56<00:00,  1.12it/s]

There was no obvious improvement in the results after the above training run with the equivalent results for the previous attempt. Probably need to do more extra training for the discriminators. Maybe even need to optimize the discriminators separately.

Resume Training for Additional Epochs

Tried to do additional discriminator training on resumption of training.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek5 -rs -se 10 -ep 5 -bs 2 -sc 50 -si 300 -xd 3 -xe 2

That failed. Control statement for applicable if block not constructed to work that way. Will see if I can find a more forgiving construct.

Well was pretty simple. I changed if epoch >= cfg.x_eps: to if epoch >= cfg.start_ep + cfg.x_eps:. And bingo! So ran two sets of resumed training, with extra discriminator training per iteration for the first 2 epochs.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek5 -rs -se 10 -ep 5 -bs 2 -sc 50 -si 300 -xd 3 -xe 2
... ...
training epochs: range(10, 15)
... ...

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek5 -rs -se 15 -ep 5 -bs 2 -sc 50 -si 300 -xd 3 -xe 2
... ...
training epochs: range(15, 20)
... ...

And needless to say, the images generated in the extra training also do not seem any better than those generated at the appropriate place in the the previous attempt.

Sadly, no improvement in training the CycleGAN over any previous attempts.

Refactor #3

Modified cyc_gan to train both discriminators separately with their own optimizer. Hope I have it right as I am going to run a lengthy training session with a number of epochs of extra training for the discriminators.

Didn’t, quick fix. Remove “dict” from list of loss names/ids to use for losses dictionary keys.

So an initial 15 epochs of training, with extra training (3 to 1) for the discriminators for the first 7 epochs.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek6 -ep 15 -bs 2 -sc 50 -si 300 -xd 3 -xe 7
image and checkpoint directories created: runs\rek6_img & runs\rek6_sv
training epochs: range(0, 15)
starting training loop
epoch 1: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [11:27<00:00,  1.03s/it]
... ...
epoch 7: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [11:24<00:00,  1.03s/it]
epoch 8: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [09:55<00:00,  1.12it/s]
... ...
epoch 15: 100%|█████████████████████████████████████████████████████████████████████████| 667/667 [09:56<00:00,  1.12it/s]

That took about 80 minutes for the first 7 epochs and another ~80 for the remaing 8 epochs. Again no real visible change in the quality of the generated images. So, I ran an additional 2 sessions, 10 epochs each, of resumed training. On my setup, we’re talking approximately 100 minutes for each session.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek6 -rs -se 15 -ep 10 -bs 2 -sc 50 -si 300
... ...
training epochs: range(15, 25)
... ...

(mclp-3.12) PS F:\learn\mcl_pytorch\proj6> python cyc_gan.py -rn rek6 -rs -se 25 -ep 10 -bs 2 -sc 50 -si 300
... ...
training epochs: range(25, 35)

And, as I image you expect, no clear improvement over any previous attempt. Every so often I get an image or two that make me think the model is getting trained. Then it produces a few that definitely aren’t working or improving.

Sample Images

Here’s a couple that made me think the model was slowly getting trained. There were one or two others that were similar. But, likely, 99% were, in my opinion, showing little or no improvement in the training of the model.

Generator A: sample image after 30+ epochs of training
Generator A: conversion attempt after 30+ epochs
Generator B: sample image after 30+ epochs of training
Generator B: conversion attempt after 30+ epochs

Losses

The plot with all the losses plotted didn’t work very well. Issues with the ranges of values left a lot of the lines on top of each other. Impossible to sort one loss from another. So I refactored the Logger class’ plot_losses() method to take a couple extra optional parameters. to_plot=None takes a list of loss keys to include in the plot. If present, only the losses matching the keys in the list will be plotted. ttl=None takes a string that will be prepended to the default plot title. So, if wanted to just plot the 2 discriminator losses, the call would look something like the following: loss_log.plot_losses(to_plot=["d_a", "d_b"], ttl="Discriminator"). Your variable names and keys may be different.

I then added a new method, get_all_losses(self). It goes through the current run directory to open all the saved losses checkpoints. In my case they are named something like losses_14_10005.pt. It creates the object’s losses member from the losses in the first checkpoint loaded. Then adds the losses from the remaining checkpoints (e.g. those saved during extended training sessions) to the list for each loss type. These complete loss values can then be used to plot the overall change in loss values for all the epochs in the current run.

  def plot_losses(self, epoch=None, iter=None, to_plot=None, ttl=None):
    keys = to_plot if to_plot else list(self.losses.keys())  
    xs = list(range(len(self.losses[keys[0]])))
    fig = plt.figure(figsize=(8, 8))
    for ky in keys:
      plt.plot(xs, self.losses[ky], label=ky)
    plt.legend()
    s_iter = f" / {self.iter}" if self.iter else ""
    plt.title(f"{ttl if ttl else ""} Losses for {self.run_nm} / {self.epoch}{s_iter}")
    plt.show()


  def get_all_losses(self):
    losses = {}
    loss_fls = [item for item in cfg.sv_dir.iterdir() if item.is_file() and "losses_" in item.as_posix()]
    log_data = torch.load(loss_fls[0])
    self.log_freq = log_data["log_freq"]
    self.losses = log_data["losses"]
    self.iter = log_data["iter"]
    for l_fl in loss_fls[1:]:
      log_data = torch.load(l_fl)
      for l_nm, l_vals in log_data["losses"].items():
        self.losses[l_nm].extend(l_vals)
        self.iter += log_data["iter"]
    self.epoch = log_data["epoch"]

I wrote some code to generate four separate plots for that last training session of 15 initial epochs and 2 resumed sessions of 10 epochs each. Discriminators trained 3 times in each iteration for the first 7 epochs. Generators always trained once per iteration.

    loss_nms = ["d_a", "d_b", "g_a", "g_b", "cyc_a", "cyc_b", "id_a", "id_b", "genr"]
    loss_log = Logger(cfg.run_nm, 50, loss_nms)
    loss_log.get_all_losses()
    loss_log.plot_losses(to_plot=["d_a", "d_b"], ttl="Discriminator")
    loss_log.plot_losses(to_plot=["g_a", "g_b"], ttl="Generator")
    loss_log.plot_losses(to_plot=["cyc_a", "cyc_b"], ttl="Cycle")
    loss_log.plot_losses(to_plot=["id_a", "id_b"], ttl="Identity")

And here are those four plots. The Discriminator and Generator loss in the first two plots are just the adversarial loss we’ve seen in previous GAN implementations. Note: the \(x\) values have no real meaning. They just run from \(0\) to the number of values in each loss list. In this case \(range(469)\).

Discriminator losses for 35 total epochs of training
Generator losses for 35 total epochs of training
Cycle losses for 35 total epochs of training
identity losses for 35 total epochs of training

What I find interesting is that neither the discriminators nor the generators appear to be learning/improving in any meaningful way. Also the discriminator has lower loss values during the epochs of extra training. Then jumps up for later epochs of training. And as one would expect the generator does the reverse. Though the generator values do seem to be a little higher than the discriminator values.

Done! Done! Done!

A lengthy post without much to show for the effort involved. Lots of hours of training. The beast was certainly earning its nickname.

Not sure where next. More discriminator training per iteration. Perhaps for more epochs. Or maybe see what happens if the generators get extra training over the discriminators. Or start messing with the learning rate—something I did see done in a couple posts/tutorials.

That said, I am leaning toward seeing what happens with more generator training per iteration. Would provide a comparison that might shed some light on my model problem(s).

Until then, I hope your model training is progressing much better than mine.