Okay, let’s get on with additional training. Or, in my case, resumed training.

In the past I just loaded the current model and optimizer states from file before continuing another training session. I also just generated a new log file. When producing the final error plot, I used a module method to load the individual files and merge the error data for plotting.

This time I am going to load the current log file into my logger instance when resuming training. And just add the new error data to it. At least that’s my goal/hope. However, I may need to change my convention for naming the file(s). Currently the log file has the number of the last epoch (which I am doing base one rather than zero) and the number of iterations completed during the session. The latter I think will be a pain. As might the former. Additionally, using different indexing bases, for code and file name, is perhaps foolhardy.

Resume Training

We have 5 epochs of training. So let’s do another 5.

I decided to drop the number of iterations from the logger file name when saving to disk. So, I renamed that file in the directory where I save the individual training session data.

I then made a few minor changes to nlp.py. And eventually to ld_chkpt() in the utilities module. Here they are:

# nlp.py
... ...
  if cfg.resume:
    print(f"\nresuming training at epoch {cfg.start_ep}, loading saved states")
    utl.ld_chkpt(cfg.sv_dir/f"lstm_{cfg.start_ep}.pt", lstm, optr)
    lgr_loss.from_file(cfg.sv_dir, cfg.start_ep)
... ...
    # I had 0 instead of ep, the loop variable
    e_losses = do_epoch(ep, lstm, tk_ldr, optr, f_loss, cfg.batch_sz, 5, d_tst=d_tst, verbose=tst_model)
... ...
  if d_tst:
    iteration = d_tst
  else:
    iteration = None    # no longer using this value for logger file name
... ...
  lgr_loss.to_file(cfg.sv_dir, fini_ep, iter=iteration)
... ...
  max_wds = 18    # wanted a few more tokens when I ran module evaluation test

# utils.py
... ...
  # don't recall why I was doing this, but causing problems now
  # if cfg.start_ep < chkpt["epoch"] + 1:
  #   cfg.start_ep = chkpt["epoch"] + 1

And let’s do those next 5 epochs of training.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -rs -bs 32 -se 5 -ep 5
 {'run_nm': 'rk1', 'batch_sz': 32, 'dataset_nm': 'no_nm', 'sv_img_cyc': 150, 'sv_chk_cyc': 50, 'resume': True, 'start_ep': 5, 'epochs': 5, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': 5, 'lrs_init': 0.01, 'lrs_steps': 25, 'lrs_wmup': 0}
image and checkpoint directories created: runs\rk1_img & runs\rk1_sv
took 3.79917439998826 to generate pairs

resuming training at epoch 5, loading saved states
        loading runs\rk1_sv\lstm_5.pt
running epochs in range(5, 10)
epoch  6: 100%|███████████████████████████████████████████████████████████████████| 13418/13418 [08:28<00:00, 26.37it/s]
epoch  7: 100%|███████████████████████████████████████████████████████████████████| 13418/13418 [08:28<00:00, 26.39it/s]
epoch  8: 100%|███████████████████████████████████████████████████████████████████| 13418/13418 [08:27<00:00, 26.43it/s]
epoch  9: 100%|███████████████████████████████████████████████████████████████████| 13418/13418 [08:28<00:00, 26.38it/s]
epoch 10: 100%|██████████████████████████████████████████████████████████████████| 13418/13418 [08:29<00:00, 26.35it/s]

GPU ran around 80-85% utilization, at a temperature of around 62-68°C, power at 260W.

Generate Sequence of Tokens

Decided I wanted to see if any obvious improvement. And, of course, none!

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 10
... ...
        loading runs\rk1_sv\lstm_10.pt
initial input: ['the', 'prince', 'left']
generated text: the prince left stepan arkadyevitch . “and what do you mean to see me ? ” “of course

Really need to write a function to clean up those output strings.

Function to Tidy Generated Output

Before we get carried away, let’s write that function to fix the grammatically incorrect spaces added to the output by my code. Took a bit of fooling around to find all the probable issues. I generated some extra outputs of a longer length to help with the process.

def tidy_output(txt):
  # fix punctuation issues, remove leading blank 
  for tk in ",.;:?!$()/_&%*@'`":
    txt = txt.replace(f" {tk}", f"{tk} ")
    # get rid of any double blanks created by above
    txt = txt.replace(f"{tk}  ", f"{tk} ")
  txt = txt.replace(f' "', f'" ')  
  # then fix leading or trailing blanks for quotes and apostrophes
  txt = txt.replace("'  ", "'")
  txt = txt.replace("' ", "'")
  txt = txt.replace(" ’", "’")
  txt = txt.replace("’ ", "’")
  txt = txt.replace(" ’ ", "’")
  txt = txt.replace('"  ', '"')
  txt = txt.replace('"  ', '"')
  txt = txt.replace("“  ", "“")
  txt = txt.replace("“ ", "“")
  txt = txt.replace("  ”", "”")
  txt = txt.replace(" ”", "”")
  return txt

Not very pretty, not sure if there is a better way to do this. I modified my code to use the new function.

  print(f"generated text: {tidy_output(fn_txt)}")

More Training

Going to run a 10 epoch batch of resumed traning before calling it a day.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -rs -bs 32 -se 10 -ep 10
... ...
resuming training at epoch 10, loading saved states
        loading runs\rk1_sv\lstm_10.pt
running epochs in range(10, 20)
... ...
epoch 20: 100%|██████████████████████████████████████████████████████████████████| 13418/13418 [08:27<00:00, 26.42it/s]

Let’s see a sample of generated text.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 20
... ...
initial input: ['the', 'prince', 'left']
generated text: the prince left himself, while he would come to get only out of the doings people,

Another 10 Epochs of Resumed Traning ― Twice

And, over the subsequent few days.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -rs -bs 32 -se 20 -ep 10
... ...
resuming training at epoch 20, loading saved states
        loading runs\rk1_sv\lstm_20.pt
running epochs in range(20, 30)
... ...
epoch 30: 100%|██████████████████████████████████████████████████████████████████| 13418/13418 [08:26<00:00, 26.51it/s]

Upped the size of the generated output string.

mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 30
... ...
initial input: ['the', 'prince', 'left']
generated text: the prince left dinner more than a young, ceremonious turkey before them. in petersburg the time the marshal of the province had been brought to him

And the second ten epochs of training.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -rs -bs 32 -se 30 -ep 10
... 
resuming training at epoch 30, loading saved states
        loading runs\rk1_sv\lstm_30.pt
running epochs in range(30, 40)
... ...
epoch 40: 100%|██████████████████████████████████████████████████████████████████| 13418/13418 [08:28<00:00, 26.40it/s]

Upped the output word count again. Though, as you can see, the model isn’t producing anything like an English sentence or two.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 40
... ...
initial input: ['the', 'prince', 'left']
generated text: the prince left up. “why, you would feel not, but a perhaps?” and dolly disliked the better; that she was not married, mortified her her ruin. the glittering and rattling

A Final Ten Epochs of Resumed Training

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -rs -bs 32 -se 40 -ep 10
... ...
resuming training at epoch 40, loading saved states
        loading runs\rk1_sv\lstm_40.pt
running epochs in range(40, 50)
... ...
epoch 50: 100%|██████████████████████████████████████████████████████████████████| 13418/13418 [08:28<00:00, 26.41it/s]

And the plot of the logged errors following that final training session. I won’t show you the progression after each training session. But there was an improvement after each resumed training session. Though a fairly minor one between the final two. Note, there are ~26.8 error values for each epoch of training.

plot of training losses after 50 epochs of (resumed) training

I have once again upped the output length to get a better idea of what the model generates.

... ...
  p_txt = "The prince left".lower().split(' ')
  max_wds = len(p_txt) + 60
... ...
(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 50
 ... ...
initial input: ['the', 'prince', 'left']
generated text: the prince left his bird. again he stopped with veslovsky, smiled. but your nurse’s always very simple, and he’s towards the point. i consider that he applies that one no, came very, and,” said kitty, just as she had just for a second they had from his family,

Not as good an outcome as I was hoping for. May do some more training; but, not currently certain I will do so. Afterall, this is only an intro of sorts. The algorithm we’re really interested in is the Transformer. So, I will likely leave things as they stand.

Done

But, not quite finished with the RNN just yet. I want to take a look at how temperature and top-K sampling affect the creativity of the model’s generated text. So a likely short next post. Then we will tackle Transformers.

Until next time, hope your RNN is generating better text output than mine.