Generating File Hashes

Okay, my script, hugo_changed_hash.py, just plain don’t work! I.E. it has at least one truly serious bug. As I explained in the preceding post, it took me quite some time to sort it out in my head. Anything I did with debug prints and such really was of no immediately obvious value. Eventually, one dream filled night, I woke around 02:00 and had the answer. It was difficult to get back to sleep — I really wanted to go crank up the computer and check out my solution. But, one must get a good night’s rest. Eventually I went back to sleep until a more reasonable hour.

Now, I guess I could just tell you what I figured out and end this post. But, I also mentioned in the last post that I would write a function to hash large/larger files one block at a time. That is read a portion of the file (a block) and update the hash using each block as they are read. Let’s leave that until later.

I also ended up writing a version of the script that used file sizes rather than hashes to try and determine which files changed. It helped me through until I could fix my bug. So, let’s start with that.

Using File Sizes Instead of Hashes

Okay, while struggling to sort out my file hashing issue, I copied the pertinent parts of the hash script to a new file and modified the docmentation and the code to search for changed files based on the current file sizes. Well, and the saved file sizes following the previous compile, via Hugo, of the whole blog site. So, at least two new files, hugo_changed_len.py and to2c_k_sizes.py. The former is the script (len = length), the latter the data file. The design of of the data file and its data structures being very much the same as that for the hashes, except I will be storing an integer giving the files last length in bytes rather than an hexadecimal hash value. It is considerably shorter. The important part of changed code looks something like the following.

...

args = parser.parse_args()
if args.wrt_file == 'y':
  wrt_file = True
if args.file_nm:
  file_nm = args.file_nm

b_path = B_DIRS[b_lbl]
chg_files = []
# p_max_size = 0
# p_max_nm = ''
#for p in b_path.rglob("*.*ml"):
for p in b_path.glob("**/*"):
  if p.is_file():
    f_size = p.stat().st_size
    f_path = p.relative_to(b_path).as_posix()
    tst_size = f_sizes.get(f_path, 0)
    if (tst_size == 0) or (f_size != tst_size):
      chg_files.append(f_path)
      f_sizes[f_path] = f_size

# convert chg_files to list of quoted names
quote_changed = [f'"{fnm}"' for fnm in chg_files]
print(f"changed files ({len(chg_files)}): {chg_files}")

if wrt_file:

...

And, it worked pretty much the way I wanted. Never did check or record the number of changes it generated versus those netlify generated. But, they were similar as I recall. Or, at least for me, close enough.

Hashing More Efficiently

In my reading someone convinced me that for larger files, I should really hash them a block at a time. So I had also written a second function, hash_block() during my development. Not yet covered, because I wanted the previous post to focus on the bug I had in my code.

And, no sense covering it now, as we still haven’t sorted the bug. Well, at least not yet in this post. And, I expect you already know what I was doing incorrectly.

The writing of this function, hash_block(), should really have told me what I was doing wrong. But, braindead is braindead. Though it probably was a part of the things my dreaming mind used to sort things out. Got to love what a brain can do while you are sleeping. When I was younger this kind of thing used to happen during my morning shower. But, my brain was very much faster back then.

Okay, This Is My Mistake

In my code I was intializing this variable D_HASH = hashlib.md5() and passing D_HASH to the individual file hashing function, hash_file(). To quote the hashlib documentation:

A couple key bits: “return a hash object” and “ask it for the digest of the concatenation of the data fed to it so far”.

In my statement initializing D_HASH I am instatiating such an object. Which I then pass to each invocation of hash_file(). That object is the one and only for the whole of the script’s traversal of the blog’s file structure. So, the data for each file keeps getting hashed and appended to that single object’s data. So, after each file is processed by my function, I really have a hash value for the sum (concatenation) of that and all previous files.

Which was fine when comparing a set of unchanged blog files. But after compiling with a new post, the first file 404.html was unchanged. And the current hash and previous hash match. But the next file, index.html, was changed. So at this point the accummulated hash value for the first two files is different than it would have been for the previous compilation. And, by default for every file thereafter. And, that’s exactly what my script was telling me. So, it was basically correct. My implementation of the hashlib function was not!

The fix is pretty straightforward. Set D_HASH to hold the function without actually instantiating an object. Then modify my file processing function to instantiate a new hash object each time it is called.

...

D_HASH = hashlib.md5

...

def hash_file(file, f_hash):
  hf = f_hash()
  with open(file, 'rb') as f: 
    data = f.read()
  hf.update(data)
  return hf.hexdigest()

And, well there you go. After tidying up my to2c_k_hashes.py data file first. Then compiling a new post — the last one in fact. Netlify found 56 changed files, as did my script. Perfect!

PS R:\hugo\tooOldCode> netlify deploy --prod
(node:24596) ExperimentalWarning: queueMicrotask() is experimental.
Deploy path:        R:\hugo\tooOldCode\ndocs
Configuration path: R:\hugo\tooOldCode\netlify.toml
Deploying to main site URL...
√ Finished hashing 490 files
√ CDN requesting 56 files
√ Finished uploading 56 assets
√ Deploy is live!

changed files (56): ['index.html', 'index.xml', 'sitemap.xml', '2021/02/22/automate/index.html', '2021/02/22/code_hash_script/index.html', 'page/2/index.html', 'page/3/index.html', 'page/4/index.html', 'page/5/index.html', 'page/6/index.html', 'page/7/index.html', 'post/index.html', 'post/index.xml', 'post/page/2/index.html', 'post/page/3/index.html', 'post/page/4/index.html', 'post/page/5/index.html', 'post/page/6/index.html', 'post/page/7/index.html', 'tags/index.html', 'tags/index.xml', 'tags/animation/index.html', 'tags/animation/index.xml', 'tags/automate/index.html', 'tags/automate/index.xml', 'tags/automate/page/1/index.html', 'tags/directory-traversal/index.html', 'tags/directory-traversal/index.xml', 'tags/directory-traversal/page/1/index.html', 'tags/file-handling/index.html', 'tags/file-handling/index.xml', 'tags/file-handling/page/1/index.html', 'tags/hash/index.html', 'tags/hash/index.xml', 'tags/histogram/index.html', 'tags/histogram/index.xml', 'tags/hugo/index.html', 'tags/hugo/index.xml', 'tags/hugo/page/1/index.html', 'tags/matplotlib/index.html', 'tags/matplotlib/index.xml', 'tags/page/2/index.html', 'tags/page/3/index.html', 'tags/page/4/index.html', 'tags/page/5/index.html', 'tags/page/6/index.html', 'tags/page/7/index.html', 'tags/page/8/index.html', 'tags/page/9/index.html', 'tags/sampling/index.html', 'tags/sampling/index.xml', 'tags/sampling/page/2/index.html', 'tags/shape/index.html', 'tags/shape/index.xml', 'tags/shape/page/2/index.html', 'tags/shape/page/3/index.html']

Don’t recall if I explained why I was using D_HASH. I had originally thought I might use a different hash function once I got things working and wanted to allow for that eventuality from the start. Maybe even allow switching it via a command line parameter. But, in the end I decided that would likely not happen. But saw no sense in eliminating the variable.

Hashing More Efficiently (continued)

Okay back to the function to hash files in blocks rather than one big gulp at a time.

You may recall in the code shown in the last post that I had defined BUF_SIZE = 65536. That is the block size recommended in the referenced stackoverflow post. Basically I check for the size of the file, it if it greater than BUF_SIZE, I use the function hash_block() rather than hash_file(). Something like:

...

def hash_block(file, f_hash):
  hf = f_hash()
  with open(file, 'rb') as f: 
    while True: 
      data = f.read(BUF_SIZE) 
      if not data: 
          break
      hf.update(data)

  return hf.hexdigest()   

...

for p in b_path.glob("**/*"):
  if p.is_file():
    fl_hash = '0'
    t_size = p.stat().st_size
    if t_size > BUF_SIZE:
      fl_hash = hash_block(p, h_func)
    else:
      fl_hash = hash_file(p, h_func)

...

I am sure it could perhaps be done a bit more efficiently, but…

That’s It For This One

I hope there was something a value for you in these last two posts. If nothing else, “Read the documentation. Then read it again!”

I have also added a post with the final code for both scripts as they currently stand.

Not sure where I will go next. But, hopefully, something will come up.

Still rather caught up in working on the online Harvard CS50 course. Currently on Week 5, trying to sort out building a hash table for a dictionary of words — in C. Also need to write code to search the table for a given word. And, a few other things, like freeing memory when done or trouble arises. Am finding it rather entertaining. Highly recommend trying the course. Next week to the course moves on to coding with Python rather than in C.

But those weeks working with C certainly brought to mind just how much Python does for us. Not too mention all the libraries providing some much power for so relatively little.

If you have the time, I do recommend taking the course. For me it has proven to be a really great refresher. And has also reminded me of all the things a high level language lets us gloss over with nary a thought. An experience which made coding in C all that more challenging for me.

Have a great week.

Resources

hashlib — Secure hashes and message digests
Hashlib: optimal size of chunks to be used in md5.update()

Too Old To Code

Automate: Utility #1 Continued