Virtually every example and tutorial I have looked at uses a class to define the model. That class is instantiated when the model is ready to be used. And, at least one example I saw saved the class in the file to which the model was being saved.

I was also thinking, that I am saving the final model following the early stop. The last one of some number of models that did not improve for some number of epochs (based on validation set tests). Perhaps I should be saving the one that generated the minimum loss before the early stop?

So, let’s see if I can get a class definition to work for the apparel multi-category classifier developed in the last post. And, regardless of that outcome, determine which model I should likely be saving to file.

Using a Class to Define the Model

There a basically two requirements for this class. An __init__ method to define the network or model layers. And a forward method to actually execute the model for each epoch. This will include all activation functions. Which may or may not have been defined in the __init__ method.

Code

In the forward method, I am applying the ReLU activation function to the output of the linear layers instantiated in the __init__ method. I will be passing the value for the Dropout probability to the __init__ method rather than hardcoding something. I will also take the opportunity to reshape the input value so that I don’t need to do that before hand.

A good majority of the code in this module is the same as that in the previous post. So, I am going to only show the significant differences.

# multi-cat_2.py
# Ver 0.1.0: 2024.03.19, rek, get started figuring this out
#  - train multi-category classification model for Fashion-MNIST dataset
#     use class to define and instantiate model (best practice?)
#     a lot of code copied from multi-cat.py

... ...
# rename file accordingly
fl_pth = f"mc2_checkpt_{int(do_p*100)}_{e_stop}.pth"
... ...
# define model 
class MCClassifier(nn.Module):
  def __init__(self, do_p):
    super().__init__()
    input_size = 28 * 28
    self.mc1 = nn.Linear(input_size, 256)
    self.mc2 = nn.Linear(256, 128)
    self.mc3 = nn.Linear(128, 64)
    self.mc4 = nn.Linear(64, 10)
    self.dropout = nn.Dropout(p=do_p)
  
  def forward(self, x):
    # x = x.view(x.shape[0], -1)
    x = x.view(-1, 28*28)
    x = F.relu(self.mc1(x))
    x = F.relu(self.mc2(x))
    x = F.relu(self.mc3(x))
    x = self.dropout(self.mc4(x))
    x = F.softmax(x, dim=1)
    return x


mc_model = MCClassifier(do_p=do_p)
mc_model = mc_model.to(device)
... ...
# function to execute single training epoch on current model state
def ex_epoch(mc_model, optimizer, loss_fn, train_loader):
  mc_model.train()
  trn_loss = 0
  for i, (imgs, lbls) in enumerate(train_loader):
    optimizer.zero_grad()
    imgs = imgs.to(device)
    lbls = lbls.reshape(-1, ).to(device)
    preds = mc_model.forward(imgs)
    loss = loss_fn(preds, lbls)
    loss.backward()
    optimizer.step()
    trn_loss += loss
  # i should now be the length of validation set batch
  return trn_loss / i


# function to calculate average loss of current model against validation set
def get_val_loss(mc_model, val_loader):
  mc_model.eval()
  v_loss = 0
  for i, (imgs, lbls) in enumerate(val_loader):
    # flatten image tensor, sent to gpu
    imgs = imgs.to(device)
    # ditto labels, sent to gpu
    lbls = lbls.reshape(-1, ).to(device)
    preds = mc_model(imgs)
    loss = loss_fn(preds, lbls)
    v_loss += loss
  # i should now be the length of validation set batch
  return v_loss / i


if not ld_model:
... ...
else:
  # load model from file and put in evaluation mode
  chk_pt = torch.load(fl_pth)
  mc_model = chk_pt['model']
  mc_model.load_state_dict(chk_pt['state_dict'])
  optimizer.load_state_dict(chk_pt['optimizer'])
  trn_loss = chk_pt['trn_loss']

# test the model
mc_model.eval()
mc_model.to(device)
st_tm = time.perf_counter()
results = []
for imgs, lbls in test_loader:
  imgs = imgs.to(device)
  lbls = lbls.reshape(-1, ).to(device)
  preds = mc_model(imgs)
  # get index of category with max probability
  pred_l = torch.argmax(preds, dim=1)
  correct = (pred_l == lbls)
  results.append(correct.detach().cpu().numpy().mean())
... ...
# save model for later reload, rather than training again
if sv_model:
  checkpt = {
    'model': MCClassifier(do_p),
    'state_dict': mc_model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': 31,
    'trn_loss': trn_loss
  }
  torch.save(checkpt, fl_pth)

0.16, 9, Dropout on 4th layer

(mclp-3.12) PS F:\learn\mcl_pytorch\chap2> python multi-cat_2.py
epoch 1: training loss = 1.7686419486999512, validation loss = 1.6443599462509155
epoch 2: training loss = 1.7082880735397339, validation loss = 1.6643993854522705
epoch 3: training loss = 1.702735424041748, validation loss = 1.6403030157089233
epoch 4: training loss = 1.6923450231552124, validation loss = 1.6497535705566406
... ...
epoch 22: training loss = 1.674271821975708, validation loss = 1.6110631227493286
epoch 23: training loss = 1.6724767684936523, validation loss = 1.6197649240493774
epoch 24: training loss = 1.6720017194747925, validation loss = 1.612855315208435
epoch 25: training loss = 1.67617928981781, validation loss = 1.644824743270874
epoch 26: training loss = 1.6707725524902344, validation loss = 1.6334954500198364
epoch 27: training loss = 1.6689926385879517, validation loss = 1.6161855459213257
epoch 28: training loss = 1.6689773797988892, validation loss = 1.624940276145935
epoch 29: training loss = 1.6749593019485474, validation loss = 1.628265380859375
epoch 30: training loss = 1.6751453876495361, validation loss = 1.6204700469970703
epoch 31: training loss = 1.6708028316497803, validation loss = 1.6307770013809204

time to train model: 299.7599093999597

prediction accuracy is 0.8333996815286624

time to test model: 1.6481092000030912

I really thought I’d end up with the same result obtained at the end of the last post. I am using the same seed for the random number generator. And, the same values for numbers of neurons on each layer, Dropout probability and number of epochs for early stop.

I simply do not understand why the result is different.

And, when I load the saved model and run it against the test set, I get a different accuracy value.

(mclp-3.12) PS F:\learn\mcl_pytorch\chap2> python multi-cat_2.py

prediction accuracy is 0.8348925159235668

time to test model: 2.0502081000013277

If I load the model from the file or instantiate the model from the class for prediction only, I get the same accuracy on the test set as the above for each execution of the module.

If I load the model parameters from the file, but use the model instantiated in the code rather than the one stored in the file, for prediction only, I get the same accuracy on the test set as above for each execution.

If I retrain the model from scratch (same code), I get the same accuracy on the test set as that following the training session shown earlier.

That the two are different really does not make sense to me. At this point I assume I will likely never know why. Just don’t understand the underlying math well enough to sort it out.

Save Last Best Checkpoint

Okay, enough worrying about something I have no idea how to fix. (Did post issue on book’s forum. Will see if anyone provides an answer).

What I want to do now is look at saving the model that had the best result on the validation set. Rather than the last one. Which was some number of epochs later; all of which got poorer results on the validation set. Then compare the result on the test set for the two models.

Was having some issues during development, so moved a few things around and added/modified terminal output.

Here’s the code changes/additions. Basically EarlyStop class, training loop and save to file loop were refactored. Seems to work.

# modified to output a second value, telling me when to save checkpoint
# early stop class
class EarlyStop:
  def __init__(self, e_wait=8):
    self.e_wait = e_wait   # stop if notbttr >= wait
    self.notbttr = 0    # number of epochs without improvement
    # last best loss, start with big number is looking for min
    self.min_loss = float('inf')
  
  # vs_loss: current value of validation set average loss
  def is_stop(self, vs_loss):
    if vs_loss < self.min_loss:
      self.min_loss = vs_loss
      self.notbttr = 0
    else:
      self.notbttr += 1
    return ((self.notbttr >= self.e_wait), self.notbttr)

# in train loop add code to save checkpoint if early stop counter reset to 0
tmp_chkpt = {}

if not ld_model:
  # okay let's train the model, will start with 100 iterations
  # hopefully the early stop via validation set will kick in sooner
  st_tm = time.perf_counter()
  for i in range(max_ep):
    trn_loss = ex_epoch(mc_model, optimizer, loss_fn, train_loader)
    val_loss = get_val_loss(mc_model, val_loader)
    c_stop, is_reset = chk_stop.is_stop(val_loss)
    print(f"epoch {i + 1}: training loss = {trn_loss}, validation loss = {val_loss} -> stop: {c_stop}, reset: {is_reset}")
    if is_reset == 0:
      tmp_chkpt = {
        # 'model': MCClassifier(do_p),
        'state_dict': mc_model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'epoch': i + 1,
        'trn_loss': trn_loss,
        'val_loss': val_loss
      }
    if c_stop == True:
      break
  nd_tm = time.perf_counter()
  print(f"\ntime to train model: {nd_tm - st_tm}")
else:
... ...

# in save to file loop add code to save the checkpoint
# save model for later reload, rather than training again
if sv_model:
  checkpt = {
    'model': mc_model,
    'state_dict': mc_model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': i + 1,
    'trn_loss': trn_loss,
    'val_loss': val_loss
  }
  torch.save(checkpt, fl_pth)
  tchk_fl = f"chkpt_{tmp_chkpt['epoch']}_{int(do_p*100)}_{e_stop}.pth"
  tmp_chkpt['model'] = mc_model
  torch.save(tmp_chkpt, tchk_fl)

Saving both the best model and the very last model following early stop. After adding code to save last best model, reconfigured module to load data and do training. Was having some issues, so moved a few things around and added/modified terminal output.

0.16, 9, Dropout on output layer

(mclp-3.12) PS F:\learn\mcl_pytorch\chap2> python multi-cat_2.py

device: cuda, do_p: 0.16, e_stop: 9
epoch 1: training loss = 1.7686419486999512, validation loss = 1.6443599462509155 -> stop: False, reset: 0
epoch 2: training loss = 1.7082880735397339, validation loss = 1.6643993854522705 -> stop: False, reset: 1
epoch 3: training loss = 1.702735424041748, validation loss = 1.6403030157089233 -> stop: False, reset: 0
epoch 4: training loss = 1.6923450231552124, validation loss = 1.6497535705566406 -> stop: False, reset: 1
epoch 5: training loss = 1.6864745616912842, validation loss = 1.6276779174804688 -> stop: False, reset: 0
... ...
epoch 22: training loss = 1.674271821975708, validation loss = 1.6110631227493286 -> stop: False, reset: 0
epoch 23: training loss = 1.6724767684936523, validation loss = 1.6197649240493774 -> stop: False, reset: 1
epoch 24: training loss = 1.6720017194747925, validation loss = 1.612855315208435 -> stop: False, reset: 2
epoch 25: training loss = 1.67617928981781, validation loss = 1.644824743270874 -> stop: False, reset: 3
epoch 26: training loss = 1.6707725524902344, validation loss = 1.6334954500198364 -> stop: False, reset: 4
epoch 27: training loss = 1.6689926385879517, validation loss = 1.6161855459213257 -> stop: False, reset: 5
epoch 28: training loss = 1.6689773797988892, validation loss = 1.624940276145935 -> stop: False, reset: 6
epoch 29: training loss = 1.6749593019485474, validation loss = 1.628265380859375 -> stop: False, reset: 7
epoch 30: training loss = 1.6751453876495361, validation loss = 1.6204700469970703 -> stop: False, reset: 8
epoch 31: training loss = 1.6708028316497803, validation loss = 1.6307770013809204 -> stop: True, reset: 9

time to train model: 304.1708910000161

prediction accuracy is 0.8333996815286624

time to test model: 1.6345650000148453

Compare the Two Saved Models

Need to reset and load saved variables. And refactor load block to load both models from file if appropriate (new/refactored variables for controlling each load and use).

Here’s the portions of the code that I altered.

# refactor/add a variable or three
# model from/to file
ld_model = True
sv_model = [False, False]
tst_model = [True, True]
tchk_fl = "chkpt_22_16_9.pth"

# refactor loading of files, setting model parameters,
# training, testing and saving of models
if not ld_model:
... ...
else:
  # load one or both models from file
  if tst_model[0]:
    chk_pt = torch.load(fl_pth)
  if tst_model[1]:
    e_chk_pt = torch.load(tchk_fl)

# test the model(s)
for i in range(2):
  do_test = True
  if ld_model:
    do_test = False
    if tst_model[0]:
      mc_model = chk_pt['model']
      mc_model.load_state_dict(chk_pt['state_dict'])
      optimizer.load_state_dict(chk_pt['optimizer'])
      do_test = True
    elif tst_model[1]:
      mc_model = e_chk_pt['model']
      mc_model.load_state_dict(e_chk_pt['state_dict'])
      optimizer.load_state_dict(e_chk_pt['optimizer'])
      do_test = True
    else:
      break
  else:
    # train with just trained model
    ...
  
  if do_test:
    mc_model.to(device)
    mc_model.eval()
    st_tm = time.perf_counter()
    results = []
    for imgs, lbls in test_loader:
      imgs = imgs.to(device)
      lbls = lbls.reshape(-1, ).to(device)
      preds = mc_model(imgs)
      # get index of category with max probability
      pred_l = torch.argmax(preds, dim=1)
      correct = (pred_l == lbls)
      results.append(correct.detach().cpu().numpy().mean())

    accuracy = np.array(results).mean()
    print(f"\n{'final model' if i == 0 else 'early checkpoint'} prediction accuracy is {accuracy}")
    nd_tm = time.perf_counter()
    print(f"\ntime to test model: {nd_tm - st_tm}")

  # if testing just trained model, do not do a second test
  if not ld_model:
    break

# save model(s) for later reload, rather than training again
if sv_model[0]:
  checkpt = {
    'model': mc_model,
    'state_dict': mc_model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': i + 1,
    'trn_loss': trn_loss,
    'val_loss': val_loss
  }
  torch.save(checkpt, fl_pth)
if sv_model[1]:
  tchk_fl = f"chkpt_{tmp_chkpt['epoch']}_{int(do_p*100)}_{e_stop}.pth"
  tmp_chkpt['model'] = mc_model
  torch.save(tmp_chkpt, tchk_fl)
(mclp-3.12) PS F:\learn\mcl_pytorch\chap2> python multi-cat_2.py

device: cuda, do_p: 0.16, e_stop: 9

final model prediction accuracy is 0.8348925159235668

time to test model: 2.1021414999850094

early checkpoint prediction accuracy is 0.8342953821656051

time to test model: 1.3659323999891058

I was sure the early checkpoint model would perform better? Though we are really looking at a pretty small difference.

Done

At the moment I think this post has gone as far as it can. I had thought about trying a number of different values for Dropout probabiliity and early stop epochs. But I am sufficiently confused about what is happening, that I think it is time for me to stop playing around with this project.

Until we meet again, may confusion not mess with your world the way it has mine this past day or two.

Resources

Postscript

And here’s the response I got from the author on the book’s forum.

The small difference is due to how floating numbers are handled by PyTorch. The torch.manual_seed() method fixes the random state so results are the same when you rerun your programs. However, you may get different results still each time even if you use the same random seed. See, e.g., the explanations here https://discuss.pytorch.org/t/different-training-results-on-different-machines-with-simplified-testcode/59378/3. The difference is generally minor, though. So no need to be alarmed.

Mark Liu