Last post I pretty much ended things in the midst of developing the code to clean-up the text from the novel. This is needed to facilitate building the vocabulary and the properly tokenized data to pass to the model during training. I am going to use two classes to do all of this work: cleaning, manage the vocabulary, generate numeric tokens from text tokens and vice-versa.

Cleaning and Tokenizing the Data (continued)

The Vocabulary class is pretty straightforward. So, I am just going to show you the code. As mentioned last post, the classes will be defined in the d_tokenize.py module.

Vocabulary

# d_tokenize.py: module for classes used to tokenize the text data

from collections import Counter   
from pathlib import Path
import torch

class Vocabulary():
  def __init__(self):
    super().__init__()
    self.word2idx = {}
    self.idx2word = []


  def add_word(self, wd, ndx):
    if wd not in self.word2idx:
      self.idx2word.append(wd)
      self.word2idx[wd] = ndx
    return self.word2idx[wd]

 
  def __len__(self):
    return len(self.idx2word)

Corpus

The Corpus class will be considerably more complicated. It will be used to clean the text, produce the vocabulary and generate a numeric tokenized version of the novel’s text. Down the road perhaps more. I will start with the cleaning bit first.

And, as usual I will save intermediate steps as appropriate and load them from file rather than recreate them each time I execute the module’s code during development. Not that it takes horribly long for most of the individual steps.

I will refer you to the discussion of what the cleaning will entail in the last post. Don’t want to repeat it here.

... ...
class Corpus():
  def __init__(self, d_dir, fl_nm):
    self.vocab = Vocabulary()
    self.d_dir = d_dir
    self.fl_nm = fl_nm
    self.txt_p = d_dir/fl_nm
    # the follwoing will change, I am just using it to run the cleaning code during development
    self.trn = self.clean()

  
  def clean(self):
    cln_pth = self.d_dir/"cln_token.pt"
    if cln_pth.exists():
      words = torch.load(cln_pth)
    else:
      with open(self.txt_p, "r", encoding="utf-8-sig") as f:
        text = f.read()
      cln_txt = text.lower().replace("\n", " ")
      cln_txt = cln_txt.replace("-", " ")
      for x in ",.:;?!$()/_&%*@'’”":
          cln_txt = cln_txt.replace(f"{x}", f" {x} ")
      cln_txt = cln_txt.replace('"', ' " ')
      words = cln_txt.split()
      print(f"saving tokenization after cleaning text to {cln_pth}")
      torch.save(words, cln_pth)

    return words

... ...
if __name__ == "__main__":
  fl_nm = "pg1399.txt"
  d_dir = Path("./data")

  corpus = Corpus(d_dir, fl_nm)
  n_tokens = len(corpus.trn)
  print(f"total number of tokens: {n_tokens}")

And, in the terminal I got the following.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python d_tokenize.py
data data\pg1399.txt
saving tokenization after cleaning text to data\cln_token.pt
total number of tokens: 429463

And when I executed the module a second time, the cleaned text was loaded from file.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python d_tokenize.py
data data\pg1399.txt
total number of tokens: 429463

Okay, let’s get the vocabulary built. The file to which I plan to save the built vocabulary and training data does not yet exist. But I will be creating and using it at some point in time.

... ...
    self.trn = self.tokenize()
... ...
  def tokenize(self):
    tk_pth = self.d_dir/"tokens.pt"
    vcb_pth = self.d_dir/"vocab.pt"

    if not (vcb_pth.exists() and tk_pth.exists()):
      cln_txt = self.clean()

    if vcb_pth.exists():
      t_vcb = torch.load(vcb_pth)
      self.vocab = t_vcb
    else:
      # build vocabulary
      # get count for each distinct token
      wd_cnts = Counter(cln_txt)
      # get vocabulary with tokens sorted on count in reverse order, just cuz
      tmp_vcb = sorted(wd_cnts.items(), key=lambda pair: pair[1], reverse=True)
      print(f"number unique tokens: {len(tmp_vcb)}\n{tmp_vcb[:10]}")
      # add tokens to vocabulary object
      for i, (wd, _) in enumerate(tmp_vcb):
        self.vocab.add_word(wd, i)
   
      torch.save(self.vocab, vcb_pth)

    # we will tackle this next, but needed placeholder for function return value
    if tk_pth.exists():
     t_tkns = torch.load(tk_pth)
    else:
      t_tkns = []

    return t_tkns
... ...
if __name__ == "__main__":
  corpus = Corpus(d_dir, fl_nm)
  n_tokens = len(corpus.trn)
  print(f"total number of tokens: {n_tokens}")
  print(f"number unique tokens: {len(corpus.vocab)}\n{corpus.vocab.idx2word[:10]}")
(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python d_tokenize.py
data data\pg1399.txt
number unique tokens: 14224
[(',', 30997), ('.', 19953), ('the', 17392), ('and', 12626), ('to', 10085), ('of', 8575), ('he', 7683), ('”', 6982), ('’', 6631), ('a', 6076)]
total number of tokens: 0
number unique tokens: 14224
[',', '.', 'the', 'and', 'to', 'of', 'he', '”', '’', 'a']

As one might expect ‘,’, ‘.’, ’the’, ‘and’, ’to’, and ‘of’ are the most common tokens in the file. And when I check the data directory vocab.pt is indeed present. Okay let’s build the numeric training data. A simple loop.

    if tk_pth.exists():
     t_tkns = torch.load(tk_pth)
    else:  
      t_tkns = []
      for wd in cln_txt:
        t_tkns.append(self.vocab.word2idx[wd])
      t_tkns = torch.tensor(t_tkns).type(torch.int64)
      torch.save(t_tkns, tk_pth)

    return t_tkns
(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python d_tokenize.py
data data\pg1399.txt
total number of tokens: 429463
number unique tokens: 14224
[',', '.', 'the', 'and', 'to', 'of', 'he', '”', '’', 'a']
number of tokens in training set: 429463

And the file tokens.pt is present in the data directory. Also, the data from those files are used to fill the appropriate properties in the Corpus class when instantiated.

Data Batches

The examples had me a might confused about why/what they were doing. They were generating a list of tokens for input. These varied in length but were of a reasonable size. I will likely use something on the order of 100 tokens. Then they were generating another list that consisted of the second element on from the input list plus what would be the next token in the source data.

Took me some time, but the night after I completed the discussion on the cleaning and tokenizing of the source text I had, during a waking moment, a bit of an epiphany. I came to realize that this model is essentially performing multicategory classification. That is, the model is trying to predict the next token from a dictionary with thousands of choices. We will be using the CrossEntropyLoss() to generate the loss values during training. It requires the current model output and a target to compare it against. So we need to generate the target for each input. Since we want the model to predict the next word in the sequence it makes sense that the target should contain that word. And it will need to be of the same size as the output for which the loss is being calculated.

Input Data Pairs

So let’s get on with building the inputs and targets. And then instantiate a dataloader. I am likely going to make the building of the input data a method within the Corpus class. Seems reasonable at the moment.

Here’s the code for the new method.

  """
    Generate input pairs to use as inputs for training
    Each pair will consist of model input and the target for that input
    i_len = number of tokens for each input
  """
  def get_trn_data(self, i_len=100):
    i_t_pth = self.d_dir/"in_pairs.pt"
    if i_t_pth.exists():
      print(f"loading data pairs from {i_t_pth}")
      inp_tgt = torch.load(i_t_pth)
    else:
      inp_tgt = []
      # print(f"generating data pairs from tokenized data")
      s_tm = time.perf_counter()
      for n in range(0, len(self.trn)-i_len-1):
        x = self.trn[n:n+i_len]
        y = self.trn[n+1:n+i_len+1]
        inp_tgt.append((x, y))
      e_tm = time.perf_counter()
      print(f"took {e_tm - s_tm} to generate pairs")
      torch.save(inp_tgt, i_t_pth)
    
    return inp_tgt

And a bit of test. I added the timing in the above method after a test or two. I got suspicious that the writing/reading of the file cost more time than generating the pairs.

if __name__ == "__main__":
  corpus = Corpus(d_dir, fl_nm)
  n_tokens = len(corpus.trn)
  print(f"total number of tokens: {n_tokens}")
  print(f"number unique tokens: {len(corpus.vocab)}\n{corpus.vocab.idx2word[:10]}")
  print(f"number of tokens in training set: {len(corpus.trn)}\n{corpus.trn[:10]}")
  st_tm = time.perf_counter()
  in_data = corpus.get_trn_data(i_len=100)
  nd_tm = time.perf_counter()
  print(f"took {nd_tm - st_tm} to generate data pairs and write to file")
  print(f"input data shape: {len(in_data)}, {len(in_data[0])}")
  print(f"tail of input and target:\n{in_data[0][0][-10:]}\n{in_data[0][1][-10:]}")

And the terminal output was:

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python d_tokenize.py
data data\pg1399.txt
total number of tokens: 429463
number unique tokens: 14224
[',', '.', 'the', 'and', 'to', 'of', 'he', '”', '’', 'a']
number of tokens in training set: 429463
tensor([ 321,   45,  208, 2820,  291, 3048,   83,   31, 2627,   34])
generating data pairs from tokenized data
took 3.4073448999843094 to generate pairs
took 30.219695200008573 to generate data pairs and write to file
input data shape: 429362, 2
tail of input and target:
tensor([   2,  157,    3,  127,  562,    0,   25,   31,    2, 1616])
tensor([ 157,    3,  127,  562,    0,   25,   31,    2, 1616,    5])

So, it takes ~3½ seconds to generate the pairs. And it takes about 26 seconds to write the data to file. A second test showed it took almost 34 seconds to load the data from file. So, in this case, I will not be using a file to store/load the input data pairs. The code for the method now looks as follows.

  def get_trn_data(self, i_len=100):
    inp_tgt = []
    for n in range(0, len(self.trn)-i_len-1):
      x = self.trn[n:n+i_len]
      y = self.trn[n+1:n+i_len+1]
      inp_tgt.append((x, y))
    
    return inp_tgt

Data Loader

Let’s add the code to instantiate the Corpus class to the main module, nlp.py. Generate the pairs, and set up the dataloader.

Also, because the default batch size in config.py is really not valid in most of the project cases, I refactored the code in that module to make the batch size parameter a required one. Won’t bother showing those code changes.

... ...
from d_tokenize import Corpus, Vocabulary
... ...
tst_dl = True
... ...
d_dir = Path("./data")
fl_nm = "pg1399.txt"
i_len = 100

torch.manual_seed(42)

corpus = Corpus(d_dir, fl_nm)
in_tgt = corpus.get_trn_data(i_len=i_len)

if not tst_dl:
  tk_ldr = torch.utils.data.DataLoader(in_tgt, batch_size=cfg.batch_sz, shuffle=True)
else:
  tk_ldr = torch.utils.data.DataLoader(in_tgt, batch_size=cfg.batch_sz, shuffle=False)
  # bit of a double check
  print(f"input data shape: {len(in_tgt)}, {len(in_tgt[0])}")
  print(f"tail of input and target:\n{in_tgt[0][0][-10:]}\n{in_tgt[0][1][-10:]}")
  i_inp, i_tgt = next(iter(tk_ldr))
  print(i_inp.shape, i_tgt.shape)
  print(f"tail of input and target from dataloader:\n{i_inp[0][-10:]}\n{i_tgt[0][-10:]}")

When I first coded the above I did not import Vocabulary from d_tokenize. And that produced the following error.

AttributeError: Can't get attribute 'Vocabulary' on <module '__main__' from 'F:\\learn\\mcl_pytorch\\proj8\\nlp.py'>

So, I added the import of Vocabulary. Following that the terminal output was as follows.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32
 {'run_nm': 'rk1', 'dataset_nm': 'no_nm', 'sv_img_cyc': 150, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 5, 'batch_sz': 4, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': 5, 'lrs_init': 0.01, 'lrs_steps': 25, 'lrs_wmup': 0}
image and checkpoint directories created: runs\rk1_img & runs\rk1_sv
input data shape: 429362, 2
tail of input and target:
tensor([   2,  157,    3,  127,  562,    0,   25,   31,    2, 1616])
tensor([ 157,    3,  127,  562,    0,   25,   31,    2, 1616,    5])
torch.Size([32, 100]) torch.Size([32, 100])
tail of input and target from dataloader:
tensor([   2,  157,    3,  127,  562,    0,   25,   31,    2, 1616])
tensor([ 157,    3,  127,  562,    0,   25,   31,    2, 1616,    5])

If I ran the test with shuffle=True the output looked like this:

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32
 {'run_nm': 'rk1', 'dataset_nm': 'no_nm', 'sv_img_cyc': 150, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 5, 'batch_sz': 4, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': 5, 'lrs_init': 0.01, 'lrs_steps': 25, 'lrs_wmup': 0}
image and checkpoint directories created: runs\rk1_img & runs\rk1_sv
input data shape: 429362, 2
tail of input and target:
tensor([   2,  157,    3,  127,  562,    0,   25,   31,    2, 1616])
tensor([ 157,    3,  127,  562,    0,   25,   31,    2, 1616,    5])
torch.Size([32, 100]) torch.Size([32, 100])
tail of input and target from dataloader:
tensor([12529,     0,    63,    17,   133,    13,   415,    49,  7365,    10])
tensor([   0,   63,   17,  133,   13,  415,   49, 7365,   10, 1534])

Note: tst_dl is now False.

Done Once Again

Things moving along, albeit rather slowly. But, we are finally ready to tackle defining the RNN model and instantiating it. Then follow that by defining a function to execute an epoch of training. But I think that will be too much content for one of my posts. So I am calling this one finished.

May your coding time go as smoothly as that covered in this post did for me.

Resources