Okay, let’s code the training loop and see how the model performs.
Training the Autoencoder
For tidiness, I will put the code for executing a single epoch of training in its own function. The function will be in the main VAE module, vae.py
. It will get the model output for each batch. Determine the reconstruction loss and the K1 divergence. Calculate the total loss and update the model.
Instantiate Model and Optimizer
But because I will be using the global variable for the VAE and optimizer in the function, let’s get them instantiated.
# instantiate model and optimizer
vae = VAE().to(cfg.device)
lr = 1e-4
opt = torch.optim.Adam(vae.parameters(), lr=lr, weight_decay=1e-5)
Function to Run One Epoch of Training
Now, the loss calculation is different from previous models, but pretty much everything else in the function should be familiar. If not exactly a duplicate of past training loops/functions.
# define function to perform a single epoch of training
def trn_ep(ep):
vae.train()
ep_loss = []
for imgs, _ in tqdm(ldr, desc=f"epoch {ep + 1}"):
imgs = imgs.to(cfg.device)
mu, std, encdd = vae(imgs)
rcn_loss = ((imgs - encdd)**2).sum()
k1_dv = ((mu**2) / 2 + (std**2) / 2 - torch.log(std) - 0.5).sum()
loss = rcn_loss + k1_dv
opt.zero_grad()
loss.backward()
opt.step()
ep_loss.append(loss.item())
# print(img.shape, regen.shape)
img_rs = imgs.reshape(cfg.batch_sz, 3, 256, 256)
rgn_rs = encdd.reshape(cfg.batch_sz, 3, 256, 256)
if cfg.batch_sz <= 16:
ipr = cfg.batch_sz // 2
s_ndx, e_ndx = 0, cfg.batch_sz
else:
s_ndx = int(cfg.rng.integers(0, cfg.batch_sz - 16, 1)[0])
e_ndx = s_ndx + 16
ipr = 8
p_img = [*img_rs[s_ndx:ipr], *rgn_rs[s_ndx:ipr], *img_rs[ipr:e_ndx], *rgn_rs[ipr:e_ndx]]
utl.image_grid(p_img, ipr, i_show=False, epoch=ep, b_sz=cfg.batch_sz, img_cl='real & regen')
return ep_loss
And the training loop, is now pretty simple.
losses = []
for ep in range(0, cfg.epochs):
e_losses = trn_ep(ep)
losses.extend(e_losses)
I tried a quick one epoch of training. Did not go well. Nor did my attempts to quickly sort it out. Here’s the meaningful part of the error message.
python torch RuntimeError: running_mean should contain 15 elements not 16
One of many bugs/typos. See it?
self.dc_cnv = nn.Sequential(
nn.ConvTranspose2d(32, 15, 3, stride=2, output_padding=1),
nn.BatchNorm2d(16),
nn.ReLU(True),
nn.ConvTranspose2d(16, 8, 3, stride=2, padding=1, output_padding=1),
nn.BatchNorm2d(8),
nn.ReLU(True),
nn.ConvTranspose2d(8, 3, 3, stride=2, padding=1, output_padding=1)
)
Fixed that.
Add Error Logging and Saving Model Checkpoint
But of course I need to clutter up that simple loop by adding the code to log/plot the losses and save the model checkpoint at the end of the training loop.
if trn_model:
loss_nms = ["vae"]
lgr_loss = Logger(cfg.run_nm, cfg.sv_chk_cyc, loss_nms)
losses = []
for ep in range(0,cfg.epochs):
e_losses = trn_ep(ep)
losses.extend(e_losses)
all_losses = {"vae" : losses}
lgr_loss.log_losses(all_losses)
utl.sv_chkpt(cfg.run_nm, ep, vae, opt, None,
cfg.batch_sz, cfg.sv_dir/f"vae_{cfg.epochs}.pt", do_jit=True)
# save losses to file and show plot
lgr_loss.to_file(cfg.sv_dir, cfg.epochs, cfg.epochs*nbr_btch)
lgr_loss.plot_losses()
Next test run also failed. I was also trying to save a TorchScript version of the model. That produced the following. Along with a bunch of stuff I have not included here.
... ...
Module 'VAEncoder' has no attribute 'N' (This attribute exists on the Python module, but we failed to convert Python type: 'torch.distributions.normal.Normal' to a TorchScript type. Only tensors and (possibly nested) tuples of tensors, lists, or dictsare supported as inputs or outputs of traced functions, but instead got value of type Normal.. Its type was inferred; try adding a type annotation for the attribute.):
File "F:\learn\mcl_pytorch\proj7\../shared_mods\autoencoder.py", line 61
mu = self.ln2(x)
std = torch.exp(self.ln3(x))
z = mu + std*self.N.sample(mu.shape)
~~~~~~ <--- HERE
return mu, std, z
I may have found a solution on-line. But, it is a little more complicated than I wish to tackle just now. So, I changed the call to include do_jit=False
.
Next test run, I had another bug that took me way to long to sort out. I originally had losses.append(e_losses)
which made a total mess of the plot. That has been fixed in the code above.
One Epoch Test Run
Okay, a “successful” test run. Though the print statements for some of the output below may not be in the code above. Part of my debugging efforts.
(mclp-3.12) PS F:\learn\mcl_pytorch\proj7> python vae.py -rn rek_3 -bs 16 -ep 1
{'run_nm': 'rek_3', 'dataset_nm': 'no_nm', 'sv_img_cyc': 150, 'sv_chk_cyc': 50, 'resume': False, 'start_ep': 0, 'epochs': 1, 'batch_sz': 16, 'num_res_blks': 9, 'x_disc': 1, 'x_genr': 1, 'x_eps': 0, 'use_lrs': False, 'lrs_unit': 'batch', 'lrs_eps': 5, 'lrs_init': 0.01, 'lrs_steps': 25, 'lrs_wmup': 0}
image and checkpoint directories created: runs\rek_3_img & runs\rek_3_sv
batch size: 16, nbr batches: 278
logger: {'vae': []}
epoch 1: 100%|███████████████████████████████████████████████████████████████████████| 278/278 [02:49<00:00, 1.64it/s]
Here’s an image showing 16 training images and the regenerated images after that one epoch of training. I am sure you can tell which images are the regenerated ones. But even after one epoch of training they are clearly faces similar to the respective training image.

And a plot of the losses. Looks to be going in the right direction.

Done
I believe that’s enough for this one. I will look at doing an extended training run. Not exactly sure how many epochs; 5 at least, maybe 10.
Until next time, may you have fewer bugs and sort them much more quickly than I did.