hugo_changed_hash.py

For what it’s worth, here you go. As referenced in Automate: Utility #1.

Have you found my bug?

"""
Script: r:/hugo/scripts/hugo_changed_hash.py

Script to determine which compiled Hugo files have probably changed
following compile based on file hash values. I currently check my
personal site output directory. Was only going to check the netlify
one, but there might be situations where that would not be sufficient.

Going to use MD5 algorithm as not really worried about security here.
Just looking for possibly changed files.

Hashes are saved in/written to a python module (need to manually
create file with empty dictionary for the first run) that is then
imported by this one. There may be more than one such file if I start
generating multiple blogs.
  - TooOldToCode: r:/hugo/scripts/to2c_k_hashes.py
The '_k_' indicates the data is for the files in the personal site
directory. An '_n_' will indicate it is for the 'netlify' files.
They will be different as I don't minify the ones on the personal site.

Arguments
----

-w [y|n] - if yes write updated hashes and list of changed files to the
           data module, defaults to 'n'

Module/global variables:
----

B_DIRS    - dictionary of starting directory by blog
BUF_SIZE  - default buffer size for hash_block()
D_HASH    - default hashing protocol
DEF_BLOG  - default blog if nothing specified on command line
wrt_file  - write new hashes/changed files to appropriate data file,
            defaults to False, i.e. no write
chg_files - list of new files or with a different hash value from
            that previously in the data file

Functions
----

hash_file(file, f_hash): hash 'file' as one block using 'f_hash' algorithm/function

"""
import hashlib
import pathlib
import argparse


def hash_file(file, f_hash):
  data = open(file, 'rb').read()
  f_hash.update(data)
  return f_hash.hexdigest()


# the compiled files for netlify are in 'ndocs'
# those for my personal site are in 'to2c_k_ca'
# Immediately convert the file paths into a pathlib path for
# later use when traversing the blog directory tree.
B_DIRS = {
  'to2c_n': pathlib.Path('r:/hugo/tooOldCode/ndocs'),
  'to2c_k': pathlib.Path('r:/hugo/tooOldCode/to2c_k_ca')
}

# An arbitrary (but fixed) buffer size (change accordingly) 
# 65536 = 65536 bytes = 64 kilobytes  
BUF_SIZE = 65536

# for my purposes md5 plenty good enough, security isn't the issue
# Thought I could save some code by calling the hash function to
# initialize D_HASH for later use
D_HASH = hashlib.md5()
DEF_BLOG = 'to2c_k'

parser = argparse.ArgumentParser(description='Check Hugo compile for changed files')
parser.add_argument('--wrt_file', '-w', help=f'Write hashes to data file?',
                    choices=['n', 'y'], default='n')

f_hashes = None
if DEF_BLOG == 'to2c_n':
  from to2c_n_hashes import f_hashes
elif DEF_BLOG == 'to2c_k':
  from to2c_k_hashes import f_hashes

# default hash data file name
file_nm = f"{DEF_BLOG}_hashes.py"
wrt_file = False
chg_files = []

args = parser.parse_args()
if args.wrt_file == 'y':
  wrt_file = True

b_path = B_DIRS[DEF_BLOG]

for p in b_path.glob("**/*"):
  if p.is_file():
    fl_path = p.relative_to(b_path).as_posix()

    fl_hash = '0'
    fl_hash = hash_file(p, D_HASH)
    hex_hash = hex(int(fl_hash, base=16))
    tst_hash = hex(f_hashes.get(fl_path, 0))
    
    if  (tst_hash == 0) or (hex_hash != tst_hash):
      chg_files.append(fl_path)
      f_hashes[fl_path] = hex_hash

# as test print out dict
print(f'{f_hashes}')
print(f'\nchanged files ({len(chg_files)} vs ({len(f_hashes)})): {chg_files}')

# if requested via cmd ln parameter write to hash data file
if wrt_file:
  with open(file_nm , 'a') as fout:
    fout.write("\nf_hashes = {")
    for k, h in f_hashes.items():
      fout.write(f"'{k}': {h},\n")
    fout.write("}\n")
    # convert chg_files to list of quoted names, so can take advantage
    # of Python's f-strings & built-in functions and avoid writing a loop
    quote_changed = [f'"{fnm}"' for fnm in chg_files]
    fout.write(f"chg_files = [{', '.join(quote_changed)}]\n")