There are a number of books that cover automating routine tasks with Python. But, for this one I pretty much went solo, given I had some ideas about how to proceed. Well, solo in terms of the problem solution and its design. Lots of help on-line with the bits and pieces.

The Problem

As I am sure I have mentioned, I use Hugo (static site generator) to generate/produce this blog. So, once a week or so, I compile the blog to get the new post for that week (or sometimes, that day if publishing more than once a week). When I was only using Netlify as my hosting platform, no problem. The netlify-cli pretty much looked after everything for me (well once I set up the configuration file accordingly). And, of course, an appropriate config file for Hugo as well.

But a while back I decided I was also going to post the blog on a personal website hosted on one of the many ISPs out there. Well, specifically HOSTINGER in this case. I still use ftp to move files to my web site. Currently via Filezilla. And, I typically only upload files that have changed since the last upload. Well except the first time, when, of course, the whole thing needed uploading. And, that took quite some time.

So after the first upload, on the next compile I went to upload the changed files. I typically sort the listing by the date each file was last updated. To my surprise all the files in the “production” folder had the same date. Apparently Hugo does not perform incremental builds. When doing a production build it compiles the whole site. I sure didn’t want to upload the whole site again. So, I made a guess about which files likely changed and only uploaded those. But…

Possible Solution

At this point I thought back to Netlify which definitely wasn’t uploading the whole blog each time. I recalled the mention of the term “hashing” in the netlify-cli output.

PS R:\hugo\tooOldCode> netlify deploy --prod
Deploy path:        █████████
Configuration path: █████████
Deploying to main site URL...
√ Finished hashing 472 files
√ CDN requesting 0 files and 0 functions
√ Finished uploading 0 assets
√ Deploy is live!

So, I figured netlify-cli was saving (likely on their servers, though possibly locally) a “hash” value for each file it uploaded. Then the next time it compared the hash values of all the blog files against the stored hash values. Any new files were uploaded and hash saved. Any files with a different hash value were uploaded and the stored value replaced with the new hash value.

Bingo, I figured I could do something similar. But I would just have my script print a list of the changed files so I knew which ones to ftp to my site. Seemed simple enough. As, I was already using a python script to count some web links in a file I use to store links of interest. No need to go into why. But, that worked for me. So hopefully this would as well.

First Attempt

So, in my main Hugo directory, I created a new directory called, strangely enough, scripts. And, within that a new file for my script, hugo_changed_hash.py. As usual Python has a built-in library, hashlib, that looks after my needs. Add some preamble/documentation, preliminary variables, etc. to the file and away we go.

I then began by sorting out the code to get a command line arugment indicating whether or not to update the hash data file. And, to modify variables as appropriate. I initially planned to be able to change the hash function via the command line, but decided (as stated in the comments) that that really wasn’t going to be necessary. But, the related variable(s) are still in the module.

"""
Script: r:/hugo/scripts/hugo_changed_hash.py

Script to determine which compiled Hugo files have probably changed following compile based on file hash values.
I currently check my personal site output directory. Was only going to check the netlify one, but there might
be situations where that would not be sufficient.

Going to use MD5 algorithm as not really worried about security here. Just looking for possibly changed files.

Hashes are saved in/written to a python module (need to manually create file with empty dictionary for the first run) that
is then imported by this one. There may be more than one such file if I start generating multiple blogs.
  - TooOldToCode: r:/hugo/scripts/to2c_k_hashes.py
The '_k_' indicates the data is for the files in the personal site directory. An '_n_' will indicate it is for the
'netlify' files. They will be different as I don't minify the ones on the personal site.

Arguments
----

-w [y|n] - if yes write updated hashes and list of changed files to the data module, defaults to 'n'

Module/global variables:
----

B_DIRS - dictionary of starting directory by blog
BUF_SIZE - default buffer size for hash_block()
D_HASH - default hashing protocol
DEF_BLOG - default blog if nothing specified on command line
wrt_file - write new hashes/changed files to appropriate data file, defaults to False
chg_files - list of new files or with a different hash value from that previously in the data file

Functions
----


"""
import hashlib
import pathlib
import argparse

# the compiled files for netlify are in 'ndocs'
# those for my personal site are in 'to2c_k_ca'
# Immediately convert the file paths into a pathlib path for later use
# when traversing the blog directory tree.
B_DIRS = {
  'to2c_n': pathlib.Path('r:/hugo/tooOldCode/ndocs'),
  'to2c_k': pathlib.Path('r:/hugo/tooOldCode/to2c_k_ca')
}
DEF_BLOG = 'to2c_k'

# An arbitrary (but fixed) buffer size (change accordingly) 
# 65536 = 65536 bytes = 64 kilobytes  
BUF_SIZE = 65536

# for my purposes md5 plenty good enough, security isn't the issue
# Thought I could save some code by calling the hash function to initialize D_HASH for later use
D_HASH = hashlib.md5()

parser = argparse.ArgumentParser(description='Check Hugo compile for changed files')
parser.add_argument('--wrt_file', '-w', help=f'Write hashes to data file?', choices=['n', 'y'], default='n')

f_hashes = None
if DEF_BLOG == 'to2c_n':
  from to2c_n_hashes import f_hashes
elif DEF_BLOG == 'to2c_k':
  from to2c_k_hashes import f_hashes

# default hash data file name
file_nm = f"{DEF_BLOG}_hashes.py"
wrt_file = False
chg_files = []

args = parser.parse_args()
if args.wrt_file == 'y':
  wrt_file = True

b_path = B_DIRS[DEF_BLOG]

Then I needed to create an initial to2c_k_hashes.py module so it could be imported when I ran the script. Note the trailing blank line. I decided to append to the file rather than overwrite as I initially planned, especially during development, to have a history (sort of a backup) of the hash values. I will manually prune the file whenever I feel it is necessary. So a trailing blank to separate the next set of data.

f_hashes = {}

And, the above doesn’t do anything, but a quick test also didn’t generate any error messages.

Now, let’s add a function to generate the MD5 hash of a file, hash_file(file, f_hash). It is passed the path of the file to process and the hash function — contained in our variable D_HASH. (Remember I had initially planned to be able to change the hash function should I so desire.) And, let’s add some code to test it with a couple of files. I am initially reading/hashing the whole file in one gulp. I will eventually change that so that larger files are processed in BUF_SIZE chunks. More memory efficient. And, one source seemed to indicate faster for larger files.

So, just after the python library imports, I added the following. I follow a convention of 2 blank lines before and after function definitions. And, I added a bit of info regarding the function in the appropriate section in the preamble.

import hashlib
import pathlib
import argparse


def hash_file(file, f_hash):
  data = open(file, 'rb').read()
  f_hash.update(data)
  return f_hash.hexdigest()

Then I needed to add the code to traverse the appropriate directory tree and hash any files. For testing I will only do the first few and stop the traversal. There are a fair number of files. No sense processing them all until I am sure things more or less work correctly. Fortunately pathlib helps out with the powerful glob() method. And, the is_file() method helps us ignore the directory entries. So at the bottom of the previous code I added the following.

cnt = 0
for p in b_path.glob("**/*"):
  if cnt == 10:
    break
  if p.is_file():
    fl_hash = '0'

    fl_hash = hash_file(p, D_HASH)
    print(f"'{p}' => {fl_hash}")
    cnt += 1

And, voila!

(base) PS C:\Users\bark\AppData\Local\Microsoft\WindowsApps> cd r:\hugo\scripts
(base) PS R:\hugo\scripts> conda activate base-3.8
(base-3.8) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py
'r:\hugo\tooOldCode\to2c_k_ca\404.html' => b37d419b5d7359daf88cfafb8874a8cd
'r:\hugo\tooOldCode\to2c_k_ca\index.html' => b148431b3964bf2ee92dbe6da9dff707
'r:\hugo\tooOldCode\to2c_k_ca\index.xml' => 9b2585a64dd4d3f346493a7d9f9aeb98
'r:\hugo\tooOldCode\to2c_k_ca\sitemap.xml' => 1b0fb281c9445830f9a531c91afb6ebf
'r:\hugo\tooOldCode\to2c_k_ca\2020\06\01\why_blog\index.html' => c894268e4b8ede6658f2d01d97512fc6
'r:\hugo\tooOldCode\to2c_k_ca\2020\06\08\dev_environ\index.html' => 349df4fd8bafe16f4683f1f76aeddd5d
'r:\hugo\tooOldCode\to2c_k_ca\2020\06\15\init_setup_1\index.html' => dd923e5ec545b30ccabc9bb7355db0ac
'r:\hugo\tooOldCode\to2c_k_ca\2020\06\18\init_setup_2\index.html' => 3af24b015d730abfbefd15d0f614b80a
'r:\hugo\tooOldCode\to2c_k_ca\2020\06\22\win_cmd_line\index.html' => 9ac64ac0a9ee757c899df2ad24f002c9
'r:\hugo\tooOldCode\to2c_k_ca\2020\06\25\init_setup_3\index.html' => a78e9c6c2bfe389fb744270dc52457ac

Okay, now to write/append the data to our data file. I will start by putting the hashes into the dictionary I imported from the appropriate data file. I know it is currently empty, but in future keys may and may not exist. If a new post or related files the key will not exist. Otherwise there will be a key and hash in the dictionary; in which case I can check for a changed file. Adding it to my changed list if it has changed or is a new file. Add or update the dictionary of hashes as appropriate.

cnt = 0
for p in b_path.glob("**/*"):
  if cnt == 10:
    break
  if p.is_file():
    fl_hash = '0'

    fl_hash = hash_file(p, D_HASH)
    #print(f"'{p}' => {fl_hash}")
    tst_hash = f_hashes.get(p, 0)
    if  (tst_hash == 0) or (fl_hash != tst_hash):
      chg_files.append(p)
      f_hashes[p] = fl_hash
    cnt += 1

# as test print out dict
print(f'{f_hashes}')

And:

(base-3.8) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py
{WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/404.html'): 'b37d419b5d7359daf88cfafb8874a8cd', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/index.html'): 'b148431b3964bf2ee92dbe6da9dff707', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/index.xml'): '9b2585a64dd4d3f346493a7d9f9aeb98', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/sitemap.xml'): '1b0fb281c9445830f9a531c91afb6ebf', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/2020/06/01/why_blog/index.html'): 'c894268e4b8ede6658f2d01d97512fc6', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/2020/06/08/dev_environ/index.html'): '349df4fd8bafe16f4683f1f76aeddd5d', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/2020/06/15/init_setup_1/index.html'): 'dd923e5ec545b30ccabc9bb7355db0ac', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/2020/06/18/init_setup_2/index.html'): '3af24b015d730abfbefd15d0f614b80a', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/2020/06/22/win_cmd_line/index.html'): '9ac64ac0a9ee757c899df2ad24f002c9', WindowsPath('r:/hugo/tooOldCode/to2c_k_ca/2020/06/25/init_setup_3/index.html'): 'a78e9c6c2bfe389fb744270dc52457ac'}

Well not quite as nice as I hoped. And, for whatever reason, I wanted to save the hashes as hex numbers, not strings. First a quick look at the pathlib documentation showed that I could use the relative_to() method to reduce some of the path info in the key. So, I tried the following changes.

    fl_path = p.relative_to(b_path)
    fl_hash = hash_file(p, D_HASH)
    #print(f"'{p}' => {fl_hash}")
    tst_hash = f_hashes.get(p, 0)

...

# changed the assigments as follows
    if  (tst_hash == 0) or (fl_hash != tst_hash):
      chg_files.append(fl_path)
      f_hashes[fl_path] = fl_hash

...

# add I am now printing the changed list as well
# as test print out dict
print(f'{f_hashes}')
print(f'\nchanged files: {chg_files}')

(base-3.8) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py
{WindowsPath('404.html'): 'b37d419b5d7359daf88cfafb8874a8cd', WindowsPath('index.html'): 'b148431b3964bf2ee92dbe6da9dff707', WindowsPath('index.xml'): '9b2585a64dd4d3f346493a7d9f9aeb98', WindowsPath('sitemap.xml'): '1b0fb281c9445830f9a531c91afb6ebf', WindowsPath('2020/06/01/why_blog/index.html'): 'c894268e4b8ede6658f2d01d97512fc6', WindowsPath('2020/06/08/dev_environ/index.html'): '349df4fd8bafe16f4683f1f76aeddd5d', WindowsPath('2020/06/15/init_setup_1/index.html'): 'dd923e5ec545b30ccabc9bb7355db0ac', WindowsPath('2020/06/18/init_setup_2/index.html'): '3af24b015d730abfbefd15d0f614b80a', WindowsPath('2020/06/22/win_cmd_line/index.html'): '9ac64ac0a9ee757c899df2ad24f002c9', WindowsPath('2020/06/25/init_setup_3/index.html'): 'a78e9c6c2bfe389fb744270dc52457ac'}

changed files: [WindowsPath('404.html'), WindowsPath('index.html'), WindowsPath('index.xml'), WindowsPath('sitemap.xml'), WindowsPath('2020/06/01/why_blog/index.html'), WindowsPath('2020/06/08/dev_environ/index.html'), WindowsPath('2020/06/15/init_setup_1/index.html'), WindowsPath('2020/06/18/init_setup_2/index.html'), WindowsPath('2020/06/22/win_cmd_line/index.html'), WindowsPath('2020/06/25/init_setup_3/index.html')]

Still not quite what I was looking for with the file paths. More digging in the docs and on-line led me to the as_posix() method. So changed the fl_path assignment as follows.

fl_path = p.relative_to(b_path).as_posix()

And, progress!

(base-3.8) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py
{'404.html': 'b37d419b5d7359daf88cfafb8874a8cd', 'index.html': 'b148431b3964bf2ee92dbe6da9dff707', 'index.xml': '9b2585a64dd4d3f346493a7d9f9aeb98', 'sitemap.xml': '1b0fb281c9445830f9a531c91afb6ebf', '2020/06/01/why_blog/index.html': 'c894268e4b8ede6658f2d01d97512fc6', '2020/06/08/dev_environ/index.html': '349df4fd8bafe16f4683f1f76aeddd5d', '2020/06/15/init_setup_1/index.html': 'dd923e5ec545b30ccabc9bb7355db0ac', '2020/06/18/init_setup_2/index.html': '3af24b015d730abfbefd15d0f614b80a', '2020/06/22/win_cmd_line/index.html': '9ac64ac0a9ee757c899df2ad24f002c9', '2020/06/25/init_setup_3/index.html': 'a78e9c6c2bfe389fb744270dc52457ac'}

changed files: ['404.html', 'index.html', 'index.xml', 'sitemap.xml', '2020/06/01/why_blog/index.html', '2020/06/08/dev_environ/index.html', '2020/06/15/init_setup_1/index.html', '2020/06/18/init_setup_2/index.html', '2020/06/22/win_cmd_line/index.html', '2020/06/25/init_setup_3/index.html']

Okay, now let’s get that string of hex digits into an hex number format. Again, Python is very helpful. If a little confusing. So, we convert the value returned by the hash function into a base-16 int. And use the hex function to do the rest. At this point the value is still a string. But when we write it to the module it will end up representing a hex number to Python. This is probably a design error on my part. Should likely just have left as strings. But…

  if p.is_file():
    fl_path = p.relative_to(b_path).as_posix()

    fl_hash = '0'
    fl_hash = hash_file(p, D_HASH)
    hex_hash = hex(int(fl_hash, base=16))
    tst_hash = hex(f_hashes.get(fl_path, 0))

    if  (tst_hash == 0) or (hex_hash != tst_hash):
      chg_files.append(fl_path)
      f_hashes[fl_path] = hex_hash
    cnt += 1

I also decided to print the number of changed files in brackets just before the colon. And?

(base-3.8) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py
{'404.html': '0xb37d419b5d7359daf88cfafb8874a8cd', 'index.html': '0xb148431b3964bf2ee92dbe6da9dff707', 'index.xml': '0x9b2585a64dd4d3f346493a7d9f9aeb98', 'sitemap.xml': '0x1b0fb281c9445830f9a531c91afb6ebf', '2020/06/01/why_blog/index.html': '0xc894268e4b8ede6658f2d01d97512fc6', '2020/06/08/dev_environ/index.html': '0x349df4fd8bafe16f4683f1f76aeddd5d', '2020/06/15/init_setup_1/index.html': '0xdd923e5ec545b30ccabc9bb7355db0ac', '2020/06/18/init_setup_2/index.html': '0x3af24b015d730abfbefd15d0f614b80a', '2020/06/22/win_cmd_line/index.html': '0x9ac64ac0a9ee757c899df2ad24f002c9', '2020/06/25/init_setup_3/index.html': '0xa78e9c6c2bfe389fb744270dc52457ac'}

changed files (10): ['404.html', 'index.html', 'index.xml', 'sitemap.xml', '2020/06/01/why_blog/index.html', '2020/06/08/dev_environ/index.html', '2020/06/15/init_setup_1/index.html', '2020/06/18/init_setup_2/index.html', '2020/06/22/win_cmd_line/index.html', '2020/06/25/init_setup_3/index.html']

That looks more like what I want. Next let’s look at writing the above data to the data file in the desired format. For the files and hashes, I want a dictionary keyed on the relative file path with the hash as a hex value. The changed file list is, well, just that a list of the changed file paths. So, the file paths will be strings, the hash a hex encoded number. I coded the following, trying to take advantage of some of the functions (power?) Python offers. Have I mentioned how much I really love comprehensions, so succinct. (At least in the right places.)


if wrt_file:
  with open(file_nm , 'a') as fout:
    fout.write("\nf_hashes = {")
    for k, h in f_hashes.items():
      fout.write(f"'{k}': {h},\n")
    fout.write("}\n")
    # convert chg_files to list of quoted names, so can take advantage of Python's
    # f-strings & built-in functions and avoid writing a loop
    quote_changed = [f'"{fnm}"' for fnm in chg_files]
    fout.write(f"chg_files = [{', '.join(quote_changed)}]\n")

A quick test.

(base-3.8) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py -w y
...

And, in the file we have:

f_hashes = {}

f_hashes = {'404.html': 0xb37d419b5d7359daf88cfafb8874a8cd,
'index.html': 0xb148431b3964bf2ee92dbe6da9dff707,
'index.xml': 0x9b2585a64dd4d3f346493a7d9f9aeb98,
'sitemap.xml': 0x1b0fb281c9445830f9a531c91afb6ebf,
'2020/06/01/why_blog/index.html': 0xc894268e4b8ede6658f2d01d97512fc6,
'2020/06/08/dev_environ/index.html': 0x349df4fd8bafe16f4683f1f76aeddd5d,
'2020/06/15/init_setup_1/index.html': 0xdd923e5ec545b30ccabc9bb7355db0ac,
'2020/06/18/init_setup_2/index.html': 0x3af24b015d730abfbefd15d0f614b80a,
'2020/06/22/win_cmd_line/index.html': 0x9ac64ac0a9ee757c899df2ad24f002c9,
'2020/06/25/init_setup_3/index.html': 0xa78e9c6c2bfe389fb744270dc52457ac,
}
chg_files = ["404.html", "index.html", "index.xml", "sitemap.xml", "2020/06/01/why_blog/index.html", "2020/06/08/dev_environ/index.html", "2020/06/15/init_setup_1/index.html", "2020/06/18/init_setup_2/index.html", "2020/06/22/win_cmd_line/index.html", "2020/06/25/init_setup_3/index.html"]

Seems to work. Now let’s test and make sure we end up with no changed files on the next execution (still limiting to 10 files).

(base) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py
{'404.html': 238582179018727778485455582391960905933, 'index.html': 235648561706110890619412447387505915655, 'index.xml': 206225165066867650899345358870936218520, 'sitemap.xml': 35970660890702275721772960941216591551, '2020/06/01/why_blog/index.html': 266614841097394438315118637913574813638, '2020/06/08/dev_environ/index.html': 69940015383505043413866042714007395677, '2020/06/15/init_setup_1/index.html': 294518727427708003845849304314037514412, '2020/06/18/init_setup_2/index.html': 78353280884160168447467223790334031882, '2020/06/22/win_cmd_line/index.html': 205730702291576059079066223639937417929, '2020/06/25/init_setup_3/index.html': 222721554076147807478582730552458368940}

changed files (0): []

Well that seems to have worked. Okay, tidy up the data file. Build the complete list of file hashes (i.e. remove the break on count of 10 files processed). Test. Takes a bit of time. Since this is for my purposes, didn’t bother adding a spinner or some such. Compile with new post and test again.

(base) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py -w y
...
changed files (473): ['404.html', 'index.html', 'index.xml', 'sitemap.xml', '2020/06/01/why_blog/index.html', ...

Seems to have worked. And again without the write.

(base) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py
...
changed files (0): []

Also seems to work. Now, compile a post and see what we get. Testing, so won’t write to file just yet.

(base) PS R:\hugo\scripts> python.exe R:\hugo\scripts\hugo_changed_hash.py
...
changed files (472): ['index.html', 'index.xml', 'sitemap.xml', '2020/06/01/why_blog/index.html', ...

Something definitely wrong. Netlify only found 31 changed files. In my list the only file with the same hash is apparently the first one (404.html). Can you see or figure out what I did wrong? To make it easier, perhaps, I have put all my current code in a side post.

Done for Today

I was going to go through to the end and correct my bug. But, it took me several days to find and understand my mistake. While that was happening I wrote a second script using file sizes to look for changed files. Figured hashes were the better way, but for the love of me I just couldn’t see what I was doing wrong. Eventually came to me in the middle of the night. By then I’d written the file size version. Which, by the way, worked reasonably well.

A bunch of googling, reading and re-reading the documentation, and it finally twigged. Expect you’ll see it a lot faster than me. In fact, everything necessary to understand my problem is in the hashlib documentation (see link in Resources section).

Until next time.

Resources

pathlib — Object-oriented filesystem paths
Recursively iterate through all subdirectories using pathlib
hashlib — Secure hashes and message digests
Binary, Hex, and Octal in Python

Too Old To Code

Automate: Utility #1

The Problem

Possible Solution

First Attempt

Done for Today

Resources