Wasn’t initailly sure what this post would be about. I was waffling between a couple of options.

But I am thinking I am going to write the code I will need to use the model, in evaluation mode, to generate text. Going to have to do so sooner or later. And I am curious as to what the model will produce after just 5 epochs of training. Also, I believe I will need to do some refactoring of my current code to get the resumption of training to work correctly. So, for now, something, hopefully, a little more entertaining.

Generating Text

Now, we don’t have a chatGPT here. We will be doing this the hard way. To start, we will feed the model an initial prompt. Obtain the ouput. Convert that into the next word (a bit of fooling around there). Append that to the current prompt. And, as long as the prompt/generated text is less than some specified length repeat until that length is reached.

We will need two iterables: one to convert words to integer indices and one to go the other way. The model needs to be fed integers. We want to see a string of words. Well, if you remember how we prepared our data for processing, we created a Corpus class which had an instantiated Vocabulary class as one of its members. That latter class had two members: the dictionary word2idx and the list idx2word. Fortunately, those two will give us what we need. And the Corpus class is always instantiated near the start of the module. (That was likely another bad design decision. But works out for me in this case.)

Set up Initial Model Evaluation State

So let’s start by instantiating the model and loading the current state from the lastest checkpoint. I will be putting this in a new if block with its own boolean variable, eval_model. Which for the present will be set to true.

Just as a check, I will temporarily include code to inspect the corpus and corpus.vocab instantiations. Which required a new import. Well, required for the way I did things. There were other choices, e.g. __dict__ and var(). But those printed out all the contents of the class members. Not something I really needed. Though my approach did require a bit more code.

import inspect
... ...
eval_model = True
... ...
if eval_model:
  cp_pth = cfg.sv_dir/f"lstm_{cfg.start_ep}.pt"
  chkpt = utl.ld_chkpt(cp_pth, lstm, optr, rtn_chk=False)
  if True:
    # let's check the corpus and vocab variables
    for obj in [corpus, corpus.vocab]:
      print(f"{obj.__class__.__name__}")
      for i in inspect.getmembers(obj):
        # to remove private and protected functions
        if not i[0].startswith('_'):
          # To remove other methods that don't start with a underscore
          if not inspect.ismethod(i[1]): 
              print(f"\t{i[0]} {len(i[1]) if obj==corpus.vocab or i[0]=="vocab" else ""}")

I couldn’t use len() for every class member. Some of them do not have a len().

And here’s what showed up in the terminal window.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 5
 {'run_nm': 'rk1', 'batch_sz': 32, 'dataset_nm': 'no_nm', 'sv_img_cyc': 150, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 5, 'epochs': 5, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': 5, 'lrs_init': 0.01, 'lrs_steps': 25, 'lrs_wmup': 0}
image and checkpoint directories created: runs\rk1_img & runs\rk1_sv
took 3.780370700013009 to generate pairs
        loading runs\rk1_sv\lstm_5.pt
Corpus
        d_dir
        fl_nm
        trn
        txt_p
        vocab 14224
Vocabulary
        idx2word 14224
        word2idx 14224

And, we now also know we have 14,224 tokens in the vocabulary. The Vocabulary class had a __len__ method defined. Which returned the length of the idx2word member. That is why len(corpus.vocab) returned that number rather than 2. The number of tokens will be relevant later on. And, not having to hard code the value will be a nice touch.

Test Prompt

Let’s move on. I want to test submitting a prompt and see what the output looks like. Referring back to the model definition for clarification as needed. So again some temporary code to inspect those outputs. Don’t know this is necessary but I like taking small steps.

Some of this code may eventually go into a function but for now I will build it in directly in the if block.

if eval_model:
  cp_pth = cfg.sv_dir/f"lstm_{cfg.start_ep}.pt"
  chkpt = utl.ld_chkpt(cp_pth, lstm, optr, rtn_chk=False)
  lstm.eval()
  max_wds = 50
  p_txt = "The prince".lower().split(' ')
  # batch of 1
  hh, hc = lstm.init_hdn(1)
  inp = torch.tensor([[corpus.vocab.word2idx[w] for w in p_txt]])
  inps = inp.to(cfg.device)
  outp, (hh, hc) = lstm(inps, (hh, hc))
  print(f"outp: {type(outp)} {outp.shape}, hh: {type(hh)} {hh.shape}, hc: {type(hc)} {hc.shape}")
  print("outp", outp[0,0,0], outp[0,1,0])
  print(f"oupt[0][-1]: {outp[0][-1].shape}")
  print(f"hh[-1][0][-1]: {hh[-1][0][-1].shape}")

In the terminal window I got the following.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 5
 {'run_nm': 'rk1', 'batch_sz': 32, 'dataset_nm': 'no_nm', 'sv_img_cyc': 150, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 5, 'epochs': 5, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': 5, 'lrs_init': 0.01, 'lrs_steps': 25, 'lrs_wmup': 0}
image and checkpoint directories created: runs\rk1_img & runs\rk1_sv
took 3.82687829999486 to generate pairs
        loading runs\rk1_sv\lstm_5.pt
outp: <class 'torch.Tensor'> torch.Size([1, 2, 14224]), hh: <class 'torch.Tensor'> torch.Size([3, 1, 128]), hc: <class 'torch.Tensor'> torch.Size([3, 1, 128])
outp tensor(0.5361, device='cuda:1', grad_fn=<SelectBackward0>) tensor(2.3947, device='cuda:1', grad_fn=<SelectBackward0>)
oupt[0][-1]: torch.Size([14224])
hh[-1][0][-1]: torch.Size([])

What Are We Getting?

As I understand it the output consists of the model’s estimate of the initial score for each possible outcome. We had a batch size of \(1\). So, the first dimension is just that. And because there were two tokens in the input, there are two tensors of scores in the output. In our case, there is one score for each token in the vocabulary. As I understand it, these scores are essentially logits.

Do note, that we have trained on a sequence size of 100 tokens. So, we shouldn’t pass more than 100 tokens to the model when generating text.

“logits” refer to the raw, unnormalized predictions generated by the last layer of a neural network before applying an activation function

What is the meaning of the word logits in TensorFlow?

Unfortunately these values can range from \(-\infty\) to \(+\infty\).

The values are interpreted as unnormalized log probabilities. So, we are going to use the softmax function to convert our logits into a probability distribution. Then use that distribution to select the next token using Numpy’s random.Generator.choice(). Which we will add to our list of tokens to produce the next input for the model. Repeating until we generate a series of tokens of the desired length

Getting the Next Token

Let’s see if we can get that coded for a single model output. (A bit of code repetition.)

if eval_model:
  cp_pth = cfg.sv_dir/f"lstm_{cfg.start_ep}.pt"
  chkpt = utl.ld_chkpt(cp_pth, lstm, optr, rtn_chk=False)
  lstm.eval()
  max_wds = 50
  p_txt = "The prince left".lower().split(' ')
  print(f"initial input: {p_txt}")
  # batch of 1
  hh, hc = lstm.init_hdn(1)
  inp = torch.tensor([[corpus.vocab.word2idx[w] for w in p_txt]])
  inps = inp.to(cfg.device)
  outp, (hh, hc) = lstm(inps, (hh, hc))
  logits = outp[0][-1]
  p = nn.functional.softmax(logits, dim=0).detach().cpu().numpy()
  nxt_tk_idx = cfg.rng.choice(len(logits), p=p)
  p_txt.append(corpus.vocab.idx2word[nxt_tk_idx])
  print(f"current list of tokens: {p_txt}")

And because I am not setting a random seed for Torch or Numpy, when I run the module twice I get a different token.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 5
... ...
initial input: ['the', 'prince', 'left']
current list of tokens: ['the', 'prince', 'left', 'cold']
(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 5
... ...
initial input: ['the', 'prince', 'left']
current list of tokens: ['the', 'prince', 'left', 'in']

Let’s Generate a Total of 10 tokens

That’s, the original 3 plus 7 more.

if eval_model:
  cp_pth = cfg.sv_dir/f"lstm_{cfg.start_ep}.pt"
  chkpt = utl.ld_chkpt(cp_pth, lstm, optr, rtn_chk=False)
  lstm.eval()
  max_wds = 10
  p_txt = "The prince left".lower().split(' ')
  print(f"initial input: {p_txt}")
  # batch of 1
  hh, hc = lstm.init_hdn(1)
  while len(p_txt) < max_wds:
    inp = torch.tensor([[corpus.vocab.word2idx[w] for w in p_txt]])
    inps = inp.to(cfg.device)
    outp, (hh, hc) = lstm(inps, (hh, hc))
    logits = outp[0][-1]
    p = nn.functional.softmax(logits, dim=0).detach().cpu().numpy()
    nxt_tk_idx = cfg.rng.choice(len(logits), p=p)
    p_txt.append(corpus.vocab.idx2word[nxt_tk_idx])
  fn_txt = " ".join(p_txt)
  print(f"generated text: {fn_txt}")

And a trial run.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 5
 {'run_nm': 'rk1', 'batch_sz': 32, 'dataset_nm': 'no_nm', 'sv_img_cyc': 150, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 5, 'epochs': 5, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': 5, 'lrs_init': 0.01, 'lrs_steps': 25, 'lrs_wmup': 0}
image and checkpoint directories created: runs\rk1_img & runs\rk1_sv
took 3.8207746000261977 to generate pairs
        loading runs\rk1_sv\lstm_5.pt
initial input: ['the', 'prince', 'left']
generated text: the prince left medicine to a standstill , taking vronsky

Hardly anything coherent. Additional executions weren’t any better. It also looks like we will need to do some tidying of the final text string—note space in front of the comma. That said, time for more training.

Done for Now

So, back to working on getting resumption of training to work as desired.

But, I did enjoy this little escape. May you also find the opportunity to escape from time to time.

Resources