We have gone through the fairly elaborate data preparation. Which included cleaning the raw textual data, creating a vocabulary, numerically tokenizing the cleaned data, and generating a dataset suitable for training the model. Took me two blog posts to cover that.
Time to move on to defining our RNN model.
Recurrent Neural Network
As mentioned in the first post in this series, I am going to use the LSTM variant of an RNN. To that end we will be using PyTorch’s LSTM layer. Well a few of them in fact. But we are also going to add a new layer to our model: an embedding layer.
In early days, tokens were converted into numbers by one-hot encoding them before feeding it to the NLP model. So, in our case, with some 14,000 unique tokens, each token would be converted into an array of 14,000 columns. And since we are feeding 100 tokens at a time to the model, each input would be an array with 100 by 14,000 cells. This large dimensionality would require the model to have a suitably large number of parameters. Not horribly efficient. In addition one-hot encoding does not capture the context or relationships between tokens.
LSTM and most modern NLP models use token embedding. Instead of huge one-hot vectors, embedding uses lower dimensional vectors. E.G. 256 columns. These individual embeddings live in a multi-dimensional embedding space. This also allows the embedding layer to capture relationships because tokens with similar meanings are close to each other in the embedding space. The embedding layer will learn the word embeddings during model training.
So our model will basically consist of an embedding layer, some number of LSTM layers, and, finally, a classification layer. There will also be dropout layers to provide some regularization.
Because I don’t currently expect to use this model again (likely look at Transformers next), I am going to put the model class in a module in the project directory not the shared directory.
lstm.py
Guess you can figure out the module name.
Now, with an LSTM model the hidden state will in fact have two components.
- Hidden State: a representation of the previous inputs, retaining information from one time step to the next (short term memory).
- Cell State: is the long-term memory of the model, the three gates mentioned in a previous post control the content and output of the cell state.
Not that we have to specifically worry too much with respect to their presence. The PyTorch LSTM layer(s) will look after that. But we will be initializing the value of the hidden state at the start of each epoch. Our code will need to account for the two components.
I am not going to provide any real explanation for the class’ code. Most of it, other than the new layers, will be familiar to anyone who has worked through the previous projects.
Well except for the last method, dtch_hdn
. During training, we need to explicitly ensure PyTorch keeps hidden states from different sequences independent of one another. So, as we will be detaching them on each iteration during training, seemed like the perfect thing to go into a method (function).
# lstm.py - module to hold class(es) related to creating the LSTM model
# and training it (as I start this I currently think the latter)
import torch
import torch.nn as nn
import sys
sys.path.append('../shared_mods')
import config as cfg # type: ignore
class LSTM(nn.Module):
def __init__(self, vcb_sz, mbd_dim=128, hdn_dim=128, n_lyrs=3, do_rt=0.2):
super().__init__()
self.vcb_sz = vcb_sz
self.mbd_dim = mbd_dim
self.hdn_dim = hdn_dim
self.n_lyrs = n_lyrs
self.do_rt = do_rt
self.embed = nn.Embedding(vcb_sz, mbd_dim)
self.lstm = nn.LSTM(input_size=self.mbd_dim, hidden_size=self.hdn_dim,
num_layers=self.n_lyrs, dropout=self.do_rt, batch_first=True)
self.d_out = nn.Dropout(do_rt)
self.fc = nn.Linear(self.hdn_dim, self.vcb_sz)
def forward(self, x, hdn):
mbdg = self.d_out(self.embed(x))
x, hdn = self.lstm(mbdg, hdn)
x = self.dropout(x)
x = self.fc(x) # prediction
return x, hdn
def init_hdn(self, b_sz):
hdn = torch.zeros(self.n_lyrs, b_sz, self.hdn_dim).to(cfg.device)
cell = torch.zeros(self.n_lyrs, b_sz, self.hdn_dim).to(cfg.device)
return hdn, cell
def dtch_hdn(self, hddn):
hdn, cell = hddn
hdn = hdn.detach()
cell = cell.detach()
return hdn, cell
And a quick test.
... ...
if __name__ == "__main__":
from pathlib import Path
from d_tokenize import Corpus, Vocabulary
... ...
if __name__ == "__main__":
fl_nm = "pg1399.txt"
d_dir = Path("./data")
corpus = Corpus(d_dir, fl_nm)
b_sz = 32
vcb_sz = len(corpus.vocab)
mbd_dim=128
hdn_dim=128
n_lyrs=3
do_rt=0.2
lstm = LSTM(vcb_sz=vcb_sz, mbd_dim=mbd_dim, hdn_dim=hdn_dim, n_lyrs=n_lyrs, do_rt=do_rt)
num_params = sum(p.numel() for p in lstm.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')
print(lstm)
And the terminal output.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python lstm.py
The model has 4,051,856 trainable parameters
LSTM(
(embed): Embedding(14224, 128)
(lstm): LSTM(128, 128, num_layers=3, batch_first=True, dropout=0.2)
(d_out): Dropout(p=0.2, inplace=False)
(fc): Linear(in_features=128, out_features=14224, bias=True)
)
I wanted to have a look at the output from the forward function. So I am going to feed a 100 token sequence to the model and have a look at the outputs. Perhaps only shape, but possibly content. Though for one sequence expect the content will be pretty meaningless.
Unfortunately, I must admit it took me some time to get all the inputs in the correct number of dimensions.
... ...
hdn = (torch.zeros(3, 1, 128).to(cfg.device), torch.zeros(3, 1, 128).to(cfg.device))
print(f"hdn: {type(hdn)}, hdn[0]: {type(hdn[0])}: {len(hdn[0])}, hdn[1]: {type(hdn[1])}: {len(hdn[1])}")
# the size -1 is inferred from other dimensions
x = corpus.trn[:100].view(1,-1).to(cfg.device)
print(f"x: {type(x)} {x.shape} ->\n\tx[-10:]: {x[0][-10:]}")
pred, (hs, hc) = lstm(x, hdn)
print(f"pred: {pred.shape}, hs: {hs.shape}, hc: {hc.shape}")
pred = pred.transpose(1,2)
print(f"pred.transpose: {pred.shape}, pred[0][0]: {pred[0][0].shape},\n\t{pred[0][0][-10:]}")
And in the terminal I got the following.
hdn: <class 'tuple'>, hdn[0]: <class 'torch.Tensor'>: 3, hdn[1]: <class 'torch.Tensor'>: 3
x: <class 'torch.Tensor'> torch.Size([1, 100]) ->
x[-10:]: tensor([ 2, 157, 3, 127, 562, 0, 25, 31, 2, 1616],
device='cuda:1')
pred: torch.Size([1, 100, 14224]), hs: torch.Size([3, 1, 128]), hc: torch.Size([3, 1, 128])
pred.transpose: torch.Size([1, 14224, 100]), pred[0][0]: torch.Size([100]),
pred[0][0][-10:]: tensor([-0.0448, -0.0481, -0.0406, -0.0175, -0.0255, -0.0469, -0.0136, -0.0300,
-0.0519, -0.0784], device='cuda:1', grad_fn=<SliceBackward0>)
Not sure what those negative values are indicating/selecting, but…
Training Loop
Before calling this post done, I am going to code the training loop for one epoch of training. I am going to put it in function in the main, nlp.py
, module. I was looking at adding it as method to the LSTM
class. But am not really certain that that will not only complicate things. That said, I may yet try to do so.
For now the function is in the main module. Though much of the function has been seen before, there are some new items.
As mentioned earlier we are detaching the hidden state each iteration. This is to ensure that token sequences are treated independently. But we are also using gradient clipping. The primary reason for this is to prevent exploding gradients which are quite common in deep RNN models.
I have also added some unnecessary bits and pieces to allow me to test the function without going through a full epoch of training. And for now I am not using tqdm
to track training progress on screen.
That said, here it is. (Well, I have at a later date modified it a little. During testing I wanted some output to see that it was in fact running. So added the verbose
parameter and used it to control printing of error rate.)
def do_epoch(ep, mdl, dldr, optr, f_loss, b_sz, clip, d_tst=False, verbose=False):
mdl.train()
losses = []
hh, hc = mdl.init_hdn(b_sz)
prv_iter = ep * len(dldr)
for i, (x, y) in enumerate(dldr):
# only process batch if it is the correct size
if x.shape[0]==b_sz:
inps, tgts = x.to(cfg.device), y.to(cfg.device)
optr.zero_grad()
# here (see below)
hh, hc = mdl.dtch_hdn((hh, hc))
pred, (hh, hc) = mdl(inps, (hh, hc))
c_loss = f_loss(pred.transpose(1, 2), tgts)
# or here
# hh, hc = mdl.dtch_hdn((hh, hc))
c_loss.backward()
# clip gradients to prevent exploding gradients
torch.nn.utils.clip_grad_norm_(mdl.parameters(), clip)
optr.step()
if i > 0 and (prv_iter + i) % cfg.sv_chk_cyc == 0:
losses.append(c_loss.item())
if verbose:
print(f"epoch: {ep + 1}, iteration: {i} -> loss: {losses[-1]}")
if d_tst and i > d_tst:
return losses
return losses
Okay, now let’s start on the actual training loop. And add a small test of the model code. Had a couple bugs, won’t bother explaining (mostly typos).
... ...
from lstm import LSTM
... ...
trn_model = True
tst_model = True
... ...
vcb_sz = len(corpus.vocab)
mbd_dim=128
hdn_dim=128
n_lyrs=3
do_rt=0.2
lr = 1e-4
# instantiate model
lstm = LSTM(vcb_sz=vcb_sz, mbd_dim=mbd_dim, hdn_dim=hdn_dim, n_lyrs=n_lyrs, do_rt=do_rt).to(cfg.device)
if trn_model:
# set up cost function, optimizer, etc.
optr = optim.Adam(lstm.parameters(), lr=lr)
f_loss = nn.CrossEntropyLoss()
if tst_model:
d_tst = 3000
else:
d_tst = False
st_tm = time.perf_counter()
e_losses = do_epoch(0, lstm, tk_ldr, optr, f_loss, cfg.batch_sz, 5, d_tst=d_tst, verbose=True)
nd_tm = time.perf_counter()
if tst_model:
print(e_losses)
print(f"time to process {d_tst} interations: {nd_tm - st_tm}")
And the output.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32
{'run_nm': 'rk1', 'dataset_nm': 'no_nm', 'sv_img_cyc': 150, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 5, 'batch_sz': 32, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': 5, 'lrs_init': 0.01, 'lrs_steps': 25, 'lrs_wmup': 0}
image and checkpoint directories created: runs\rk1_img & runs\rk1_sv
took 3.3997363000016776 to generate pairs
epoch: 1, iteration: 100 -> loss: 6.298485279083252
epoch: 1, iteration: 200 -> loss: 6.191473484039307
epoch: 1, iteration: 300 -> loss: 6.034035682678223
[6.298485279083252, 6.191473484039307, 6.034035682678223]
time to process 3000 interations: 113.18612420000136
Now a full epoch will have approximately 429300 // 32 interations. And at 113 secs for every 3000 iterations we are talking ((429300 // 32) / 3000) * 113 seconds per epoch. And…
>>> (429300 // 32)
13415
>>> ((429300 // 32) / 3000)
4.471666666666667
>>> ((429300 // 32) / 3000) * 113
505.29833333333335
>>> ((429300 // 32) / 3000) * 113 / 60
8.42163888888889
>>> ((429300 // 32) / 3000) * 113 / 60 * 50
421.08194444444445
>>>
For 50 epochs we are talking something like 7 hours of running time. Don’t know if I want to run the beast for that long in one stretch.
This One Done
Think that’s it for this post. I will finish coding the training loop next post. That will include saving the model and optimizer state dictionaries, logging the losses on each epoch and likely plotting the losses for each training run. I will also set things up to allow the resumption of training after some previous run.
May your time at the keyboard be productive and satisfying. Mine, surprisingly, has been of late.
Resources
- torch.nn.LSTM()
- torch.nn.Embedding()
- What is nn.Embedding really?
- Explaining Embedding layer in Pytorch
- What is the purpose of the cell state in LSTM?
- What’s the difference between “hidden” and “output” in PyTorch LSTM?
- Understanding Gradient Clipping