I still haven’t sorted out my problem getting the CycleGAN to successfully train. But I figured I’d move on—taking a relatively big step. I am going to try some natural language processing (NLP). And I am going to start with a recurrent neural network (RNN). These are in fact not often used these days—transformers are the currently preferred choice. But the tutorials I looked at suggested coding and training a RNN would provide a good introduction to NLP and its trials and tribulations.

Basic RNN Concept

RNNs were designed to handle sequential data. Be that time series or text.

The basic idea of recurrent neural networks is that they emulate a primitive form of memory. This is critical for NLP because we don’t read/speak words in isolation. Each new word is more or less related to all the words that came before it. Be that building an idea, grammar or logic. RNNs can to some extent learn that contextual information. The RNN does this by introducing a hidden state which encodes information about past inputs. At each step, the RNN takes in an input and the previous hidden state and produces an output and a new hidden state. The hidden state’s previous steps influence the processing of the current input.

The particular type of RNN I am going to use is an LSTM (Long Short-Term Memory). No I didn’t sort out the concepts or math involved. That said, LSTMs were proposed in order to deal with the vanishing and exploding gradient problems experienced when training a basic RNN.

An LSTM cell has three main gates:

Forget gate: Controls what information to discard from the previous cell state
Input gate: Controls what new information to add to the cell state
Output gate: Controls what information from the cell state to output

Understanding Recurrent Neural Networks Step-by-Step with PyTorch, Jordan Brown

Lot’s of resources below to provide much better and likely more accurate explanations of RNNs. Let’s get to the job at hand.

New Project Directory and Initial Project File

Okay, new directory, ..\mcl_pytorch\proj8. And a data subdirectory. Then a new python module, nlp.py, based on my default startup code for these projects. I didn’t call it rnn.py because I expect that will be the name for the module containing the model code. Copied the basic startup code (imports, command line processing, boolean control variables, bit of debugging setup, etc) from a previous project.

Won’t bother showing it here. Expect you likely have your own version.

Data Preparation

Tutorials I looked at trained the model on the WikiText-2 dataset, Reddit clean jokes dataset, novels, etc. One of the most common, of the novels used, was Anna Karenina, by Leo Tolstoy. No idea why. But the first two datasets didn’t much interest me and I’ve never read Anna Karenina, so that’s what I will use to train the RNN model.

Most of the tutorials I looked at used character-level training. One or two used word-level training. Word-level tokenization brings more semantic meaning to training the model than character-level tokenization. Though it brings it own issues. With character tokenization the size of the “vocabulary”, i.e. the unique ’tokens’, the model must work with is generally pretty small. E.G. for English text, 26 letters of the English alphabet plus 6 to 12 extra characters (punctuation, quotes, etc.). With word tokenization the number of unique tokens gets rather large requiring the model to track many more parameters. Though there are techniques to help reduce this load. One of which we will use.

Data preparation is one of the key steps in training a NLP model. We need to extract all the possible tokens so that we can establish the “vocabulary”. That involves manipulating the source data to generate a list of individual tokens. Then extracting the unique tokens and building the vocabulary. Which we will need to convert tokens to numbers (for training) and vice versa (for prediction output).

We will need code to divide the long sequence generated from tokenizing the novel into shorter sequences of equal length. These sequences will be the input features of the model. We generate the recurrent layer output by shifting the input sequence one token to the right. This allows the LSTM model to predict the next token in a sequence of tokens. These pairs of sequences are the training data.

All of the code related to preparing and serving the tokenized data will go into a separate module, tokenize.py. Not sure that’s the greatest name but… I expect there will eventually be classes made available from that module for working with the vocabulary and the parsed text of the novel.

Get Dataset

I used the Project Gutenberg site to get a textual copy of the novel. A search brought me to this page. I obviously didn’t want an e-book or any non-textual version (which also excluded HTML versions). Fortunately there was the link: Plain Text UTF-8. That presented a page with the complete text of the novel, along with some additional information. I saved the page, ~2.0 MB, to a file in my project’s data directory.

Before proceeding with processing the file, I deleted everything at the start of the file before the book’s header for Part 1 (~ line 53). I also deleted everything starting with the line “*** END OF THE PROJECT GUTENBERG EBOOK ANNA KARENINA ***” to the end of the file. Well, also included a few blank lines above that—from line number 39873 in the my copy of the file. Probably could have deleted more at the start of the file, but…

Now onto the real work.

Note: I had thought about using a library, e.g. NLTK, to do the tokenizing for me. But figured it best to, at least once, do it “by hand”. That way I will better understand what is involved.

Tokenizing and Cleaning Up the Text

Initial Tokenizing

Let’s start by loading the file and splitting the text on spaces. A first attempt at word tokenization. Then have a look at what we have. As I will be developing in steps, I will be saving intermediate stages of the data preparation to files so as to avoid repeating the earlier file processing steps with each iteration of development. Which will affect the structure of the code as I slowly develop the full data parsing process.

from pathlib import Path

fl_nm = "pg1399.txt"
d_dir = Path("./data")

if __name__ == "__main__":
  do_init_toke = False

  with open(d_dir/fl_nm, "r") as f:
    text = f.read()
  words = text.split(" ")
  print(words[:20])
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1985: character maps to <undefined>

And a bit of an issue. So I added an encoding="utf-8" parameter to the call to open.

Then I had that \ufeff at the start of the file. Changing the above to encoding="utf-8-sig" fixed that issue.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python tokenize.py
data data\pg1399.txt
['\ufeffPART', 'ONE\n\nChapter', '1\n\n\nHappy', 'families', 'are', 'all', 'alike;', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its\nown', 'way.\n\nEverything', 'was', 'in', 'confusion', 'in', 'the']
(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python tokenize.py
data data\pg1399.txt
['PART', 'ONE\n\nChapter', '1\n\n\nHappy', 'families', 'are', 'all', 'alike;', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its\nown', 'way.\n\nEverything', 'was', 'in', 'confusion', 'in', 'the']

As I mentioned, I plan to save intermediate steps as I work on building the module and its classes. So I imported torch and added the following to the above code.

... ...
import torch
... ...
fl_nm = "pg1399.txt"
d_dir = Path("./data")

if __name__ == "__main__":
  ... ...
  t_fl = "init_toke.pt"
  print(f"saving init tokenization to {d_dir/t_fl}")
  torch.save(words, d_dir/t_fl)

And boom! My apologies for the lengthy and incomprehensible error message.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8a> python tokenize.py
Traceback (most recent call last):
  File "F:\learn\mcl_pytorch\proj8a\tokenize.py", line 2, in <module>
    import torch
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\__init__.py", line 1478, in <module>
    _C._initExtension(manager_path())
    _lazy_call(_check_capability)
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\site-packages\torch\cuda\__init__.py", line 235, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))
                                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\traceback.py", line 218, in format_stack
    return format_list(extract_stack(f, limit=limit))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\traceback.py", line 232, in extract_stack
    stack = StackSummary.extract(walk_stack(f), limit=limit)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\traceback.py", line 395, in extract
    return klass._extract_from_extended_frame_gen(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\traceback.py", line 438, in _extract_from_extended_frame_gen
    f.line
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\traceback.py", line 323, in line
    self._line = linecache.getline(self.filename, self.lineno)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\linecache.py", line 30, in getline
    lines = getlines(filename, module_globals)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\linecache.py", line 46, in getlines
    return updatecache(filename, module_globals)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\appDev\Miniconda3\envs\mclp-3.12\Lib\linecache.py", line 136, in updatecache
    with tokenize.open(fullname) as fp:
         ^^^^^^^^^^^^^
AttributeError: module 'tokenize' has no attribute 'open'

I spent a good few hours trying to sort this out. I finally changed the file name and got things working. Turns out Python has a tokenize module. My module was managing to mess with Python.

Why didn’t it explode until I imported torch? I expect it was because all the earlier code and imports where internal to Python. So it never needed to call tokenize(). Only when I imported the external package, i.e. torch, did Python need to invoke the package I inadvertently replaced with my code. Which of course would not, and did not, go well.

I have now renamed the module d_tokenize.py. And…

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8a> python d_tokenize.py
data data\pg1399.txt
['PART', 'ONE\n\nChapter', '1\n\n\nHappy', 'families', 'are', 'all', 'alike;', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its\nown', 'way.\n\nEverything', 'was', 'in', 'confusion', 'in', 'the']
saving init tokenization to data\init_toke.pt

Okay, let’s move on. I will start by testing the loading of the initial parse of the novel from the file I saved it to. Then move on to cleaning up the tokens. Though I will in fact need to load the file from scratch to do so. So saving the initial tokenization was really just a test.

Cleaning up the Text

Looking at the text a few things stand out. Firstly, line breaks are treated as text characters. For example 'ONE\n\nChapter'. We really want that to be two distinct tokens. So we need to remove the line breaks before we tokenize the text. In this same example we see another potential issue. In our vocbulary we would not want ONE to be a different token than one or One. Similarly for Chapter. So we will also convert the text to lowercase before generating our vocabulary. And in the next example, 'way.\n\nEverything', we have something else that will need tidying. The token of interest is way; we don’t want way. to be a different token. So we will need to put spaces around any and all punctuation and the like. And though there is no example in that sample of tokenized text above, I expect that we really don’t need hyphens to be in the vocabulary. According to a quick check there are 1570 of them in my copy of the novel’s text. Seems like way too many. Here are a few examples: man-cook, leather-covered, well-cared-for. So we will also get rid of them. And the quotes and apostrophes in my copy are the fancy UTF-8 versions. Something else we will need to account for.

At this point I have decided to start building the classes I will be using in preparing the text for training. And generating the inputs for training. Based on an example, I am going to start with two classes: Vocabulary and Corpus. But that will be for another day.

Done

I believe this post is plenty long enough. So rather than continue with coding of the classes, I am going to consider it finished.

May your naming conventions cause you less grief than mine did.

Resources

A bit of overkill, but I, for now, wanted ready access to these tutorials/posts without running a search again.