We currently have a lot of stuff going on in the population_by_age.py module. It would likely be best to move some of that to separate modules and/or packages. E.G. getting data from population CSV and plotting charts. You know — separation of concerns. For now all I want to leave in the population_by_age.py module is the code related to the user interface. I.E. displaying menu choices, getting data used to display chart, etc.

And, we haven’t yet done anything with the rc_names.py module we wrote a bit earlier. I.E. before our side trip to restructure the project layout, which we are currently still on.

Before doing any of that however, I’d like to get the existing module to function as is. We don’t really want to start refactoring broken code.

First, Fix Existing Code

The first attempt to run the code failed because it could not find the CSV file with the population data. So, I decided to add a DATA_DIR variable. Since I run the code from within VS Code and my project opens in the py_play folder, the current working directory (cwd) is always ...\py_play. Relative to that cwd, the data dir is at .\data. So, I added DATA_DIR = "./data" and ran the code.

When I run the app in VS Code everything seems to work fine. But, if I use a terminal window, go to the py_play\population directory and run the app things fail.

PS C:\Users\bark> cd r:\learn\py_play\population
PS R:\learn\py_play\population> python population_by_age.py
Please provide the country and year for which you wish a plot, or 'q' to quit.
Country: Belgium
4-digit year (1950-2100): 2002
Attempting to plot data for Belgium in the year 2002.
Traceback (most recent call last):
  File "population_by_age.py", line 126, in <module>
    process_plot(country, year, csv_nm)
  File "population_by_age.py", line 91, in process_plot
    lines = get_file_lines(country, year, fl_nm)
  File "population_by_age.py", line 60, in get_file_lines
    csv_fl = open(DATA_DIR / csv_nm, 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'data\\WPP2019_PopulationByAgeSex_Medium.csv'
PS R:\learn\py_play\population>

How likely is this scenario? Don’t know. But given I have population_by_age.py in py_play\population it seems likely to me that someone might do just this. So what can we do about it?

Well, I have an approach, but is it the best one? Don’t know. But for now it will work and it also deals with the file separator issue. I am going to use the pathlib module. But, you need to be running Python 3.4 or better. Specifically I will be using the Path class from the module.

This module offers classes representing filesystem paths with semantics appropriate for different operating systems.

pathlib — Object-oriented filesystem paths

I am going to check which directory (of the two most likely ones) we are in and set the DATA_DIR variable accordingly. If we aren’t somewhere I expect, I’ll print a message and exit the app. First things first, make sure you import the Path class at the top of your file, from pathlib import Path. Then I added the following above the definition for the csv_nm variable.

from pathlib import Path

# Test for the two most likely starting working directories.
# If found set DATA_DIR accordingly. Using pathlib to cover operating system issues.
# If not one of them print message and exit.
p = Path.cwd()
if (p.is_dir() and p.name == 'py_play'):
  DATA_DIR = Path('./data')
elif (p.is_dir() and p.name == 'population'):
  DATA_DIR = Path('../data')
else:
  print(f"\nCurrent working directory is an unexpected location, {p}!")
  print(f"Please run the application from the root directory. Quitting application")
  exit(1)

csv_nm = 'WPP2019_PopulationByAgeSex_Medium.csv'

Oh, yes. I also had to modify the filename parameter in the open() expression in the get_file_lines() function to read: csv_fl = open(DATA_DIR / csv_nm, 'r'). That filename expression works because of the Path class. It is also rendered as appropriate for each operating system on which the code is run.

OK, for better or worse, that works. Though I am thinking I will add a run_app.py in the root directory at some point. That should make setting the various application paths a little easier. Who knows, the project layout may undergo more changes.

We are now in a position to look at refactoring the current code into suitable modules/packages.

Refactor Code Into Multiple Focused Modules/Packages

Chart Module

Let’s start with the charting code. Only one function and a couple of imports to be moved. I moved the function plot_bar_chart() from py_play\population\population_by_age.py to py_play\population\chart\chart.py. Then I moved the imports for pyplot and numpyfrom and to the same modules. In population_by_age.py I added import chart and modified the call to the plotting function in process_plot() to include the package name, chart.plot_bar_chart(...). I figured we’d be good to go. No such luck.

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population/population_by_age.py

Please provide the country and year for which you wish a plot, or 'q' to quit.
Country: peru
4-digit year (1950-2100): 2002
Attempting to plot data for peru in the year 2002.
Traceback (most recent call last):
  File "r:/learn/py_play/population/population_by_age.py", line 107, in <module>
    process_plot(country, year, csv_nm)
  File "r:/learn/py_play/population/population_by_age.py", line 76, in process_plot
    chart.plot_bar_chart(proper_c_nm, year, pop_data)
AttributeError: module 'chart' has no attribute 'plot_bar_chart'

I then tried a variety of import statements. Googled and googled. Read and read. I stil don’t fully understand how imports in my project layout really work. But I did find a solution. It all ends up coming down to what Python sees with respect to the file structure when it is run. On my system this is what the Python path variable contains:

sys.path: ['r:\\learn\\py_play\\population', 'E:\\appDev\\Miniconda3\\envs\\base-3.8\\python38.zip', 'E:\\appDev\\Miniconda3\\envs\\base-3.8\\DLLs', 'E:\\appDev\\Miniconda3\\envs\\base-3.8\\lib', 'E:\\appDev\\Miniconda3\\envs\\base-3.8', 'E:\\appDev\\Miniconda3\\envs\\base-3.8\\lib\\site-packages']

The key item is that first value: r:\\learn\\py_play\\population. Because the first thing we execute is a module in that directory, that is what it sets up as its starting location. Everything else in the list is stuff added on start up by Python or conda. Essentially it means Python will look for imports starting in the population directory. If it doesn’t find it there it will continue looking in the other entries in the path variable. So, we need to tell it to look in the chart folder and load the chart.py module. The following finally worked: from chart import chart.

And, bingo, the app once again displays charts. Hopefully with the correct data. I haven’t yet really checked that very carefully.

One issue though is that VS Code keeps telling me there is an error in population_by_age.py. Specifically:

  {
    "resource": "/r:/learn/py_play/population/population_by_age.py",
    "owner": "python",
    "code": "import-error",
    "severity": 8,
    "message": "Unable to import 'chart'",
    "source": "pylint",
    "startLineNumber": 4,
    "startColumn": 1,
    "endLineNumber": 4,
    "endColumn": 1
  }

I am guessing this is because VS Code is starting the application from the py_play directory. And the path to the chart package is not correct if working from that directory. Don’t know if there is a way I can tell VS Code to ignore the error. I committed my changes.

Oh yes, while doing all that problem troubleshooting, I ended up with a bunch more __pycache__ folders. Git wasn’t ignoring them. So I had to mofify my .gitignore. I changed __pycache__/* to a plain __pycache__ and added *.pyc for good measure. That worked. I committed this change as well (separately of course.)

Fix a Bit of Bad Logic

Earlier I had added the code to generate a path for the data folder, DATA_DIR. And, I then used that variable in the function get_file_lines() to open the CSV file: csv_fl = open(DATA_DIR / csv_nm, r). Well the call to get_file_lines() includes a file_nm parameter. And, this function is currently only called from process_plot() which also includes a file name parameter. I should be settting the csv file name in the call to process_plot() and letting it pass it on to other called functions. And, I figured I could also give that parameter a default value. So, I changed the definition to read: process_plot(country, year, fl_nm=DATA_DIR / csv_nm):. And I changed the call open() call in get_file_lines() to read: csv_fl = open(fl_nm, 'r'). And committed my changes. Now we can carry on.

Population Database Module/Package

Okay, moved the following from population_by_age.py to database/population.py:

  • import csv
  • definitions for csv index value contants, e.g. X_NM
  • defintions for the functions: get_pop_data(), get_file_lines()

Then I updated population_by_age.py:

  • added from database import population as pdb, using alias as don’t want to type full name each time
  • modified process_plot() to use the package label where appropriate (function calls and index variable)

Simple test says it works. Though VS Code still complaining about the imports from my application.

population_by_age.py

import argparse
from pathlib import Path
from chart import chart
from database import population as pdb

# Test for the two most likely starting working directories.
# If found set DATA_DIR accordingly. Using pathlib to cover operating system issues.
# If not one of them print message and exit.
p = Path.cwd()
if p.name == 'py_play':
  DATA_DIR = Path('./data')
elif p.name == 'population':
  DATA_DIR = Path('../data')
else:
  print(f"\nCurrent working directory is an unexpected location, {p}!")
  print(f"Please run the application from the root directory. Quitting application")
  exit(1)

csv_nm = 'WPP2019_PopulationByAgeSex_Medium.csv'


def process_plot(country, year, fl_nm=DATA_DIR / csv_nm):
  lines = pdb.get_file_lines(country, year, fl_nm)
  if lines:
    proper_c_nm = lines[0][pdb.X_NM]
    pop_data = pdb.get_pop_data(lines)
    chart.plot_bar_chart(proper_c_nm, year, pop_data)
  else:
    print(f"\n! No data found for '{country}' and '{year}'.\n")


parser = argparse.ArgumentParser()
# long name preceded by --, short by single -
parser.add_argument('--country', '-c', help='Country name')
# even though we need to pass it to other functions as string, set type to int, as a means of error checking
# and we know that we currently only have data for the years 1950, 2100
# parser.add_argument('--year', '-y', type=int, help='4-digit year', choices=range(1950, 2101))
# didn't like the help/error output, so leaving choices argument out
parser.add_argument('--year', '-y', type=int, help='4-digit year')

args = parser.parse_args()
if args.country and args.year and args.year >= 1950 and args.year <= 2100:
  process_plot(args.country, str(args.year), csv_nm)
else:
  while True:
    print("Please provide the country and year for which you wish a plot, or 'q' to quit.")
    country = input("Country: ")
    if country.lower() == 'q':
      break
    year = input("4-digit year (1950-2100): ")
    if year.lower() == 'q':
      break
    if year < '1950' or year > '2100':
      print('Country must be in the range 1950 - 2100 inclusive!')
      continue
    print(f'Attempting to plot data for {country} in the year {year}.')
    process_plot(country, year)

chart.py

import matplotlib.pyplot as plt
import numpy as np

def plot_bar_chart(country, year, pop_data):
  # define the x-labels for the chart
  x_labels = pop_data.keys()
  # get the y-values for each x-label
  x_values = pop_data.values()
  # figure out where to put each of the x-labels based on their size, nice of numpy to help
  y_pos = np.arange(len(x_labels))

  # because of the x-label sizes, we need a largish display
  plt.figure(figsize=(15,7.5))
  # give matplotlib.pyplot the values it needs to sort the chart
  plt.bar(y_pos, x_values, align='center', alpha=0.5)
  # tell it what the x-labels are and where to put them
  plt.xticks(y_pos, x_labels)
  # add some info regarding the axes and give the chart a title.
  plt.xlabel('Age Group')
  plt.ylabel('Population (1000s)')
  plt.title(country + ' ' + year + ': Population by Age Group')

  # generate the plot
  plt.show()

population.py

import csv

# indices to certain fields in the CSV file data
X_NM = 1
X_YR = 4
X_AGE_GRP = 6
X_TOT_POP = 11


def get_pop_data(src_data):
  age_group_data = {}
  
  for row in src_data:
    # save the stuff we want to our data dictionary
    age_group_data[row[X_AGE_GRP]] = float(row[X_TOT_POP])

  return age_group_data


def get_file_lines(country, year, file_nm):
  lines = []
  found_pair = False
  csv_fl = open(file_nm, 'r')
  #for line in csv_fl:
  # csv reader returns a reader object which iterate over the lines of the csv file
  r = csv.reader(csv_fl, delimiter=',', quotechar='"')
  for row in r:  
    # don't forget to account for user typing preferences
    found = country.lower() == row[X_NM].lower() and year == row[X_YR]
    if found:
      found_pair = True
      lines.append(row)
    if found_pair and not found:
      break
  csv_fl.close()
  return lines

Don’t forget to commit all these changes to version control. And that’s it for this one.