We currently have a lot of stuff going on in the population_by_age.py
module. It would likely be best to move some of that to separate modules and/or packages. E.G. getting data from population CSV and plotting charts. You know — separation of concerns. For now all I want to leave in the population_by_age.py
module is the code related to the user interface. I.E. displaying menu choices, getting data used to display chart, etc.
And, we haven’t yet done anything with the rc_names.py
module we wrote a bit earlier. I.E. before our side trip to restructure the project layout, which we are currently still on.
Before doing any of that however, I’d like to get the existing module to function as is. We don’t really want to start refactoring broken code.
First, Fix Existing Code
The first attempt to run the code failed because it could not find the CSV file with the population data. So, I decided to add a DATA_DIR
variable. Since I run the code from within VS Code and my project opens in the py_play
folder, the current working directory (cwd) is always ...\py_play
. Relative to that cwd, the data dir is at .\data
. So, I added DATA_DIR = "./data"
and ran the code.
When I run the app in VS Code everything seems to work fine. But, if I use a terminal window, go to the py_play\population
directory and run the app things fail.
PS C:\Users\bark> cd r:\learn\py_play\population
PS R:\learn\py_play\population> python population_by_age.py
Please provide the country and year for which you wish a plot, or 'q' to quit.
Country: Belgium
4-digit year (1950-2100): 2002
Attempting to plot data for Belgium in the year 2002.
Traceback (most recent call last):
File "population_by_age.py", line 126, in <module>
process_plot(country, year, csv_nm)
File "population_by_age.py", line 91, in process_plot
lines = get_file_lines(country, year, fl_nm)
File "population_by_age.py", line 60, in get_file_lines
csv_fl = open(DATA_DIR / csv_nm, 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'data\\WPP2019_PopulationByAgeSex_Medium.csv'
PS R:\learn\py_play\population>
How likely is this scenario? Don’t know. But given I have population_by_age.py
in py_play\population
it seems likely to me that someone might do just this. So what can we do about it?
Well, I have an approach, but is it the best one? Don’t know. But for now it will work and it also deals with the file separator issue. I am going to use the pathlib module. But, you need to be running Python 3.4 or better. Specifically I will be using the Path class from the module.
This module offers classes representing filesystem paths with semantics appropriate for different operating systems.
pathlib — Object-oriented filesystem paths
I am going to check which directory (of the two most likely ones) we are in and set the DATA_DIR
variable accordingly. If we aren’t somewhere I expect, I’ll print a message and exit the app. First things first, make sure you import the Path class at the top of your file, from pathlib import Path
. Then I added the following above the definition for the csv_nm
variable.
from pathlib import Path
# Test for the two most likely starting working directories.
# If found set DATA_DIR accordingly. Using pathlib to cover operating system issues.
# If not one of them print message and exit.
p = Path.cwd()
if (p.is_dir() and p.name == 'py_play'):
DATA_DIR = Path('./data')
elif (p.is_dir() and p.name == 'population'):
DATA_DIR = Path('../data')
else:
print(f"\nCurrent working directory is an unexpected location, {p}!")
print(f"Please run the application from the root directory. Quitting application")
exit(1)
csv_nm = 'WPP2019_PopulationByAgeSex_Medium.csv'
Oh, yes. I also had to modify the filename parameter in the open()
expression in the get_file_lines()
function to read: csv_fl = open(DATA_DIR / csv_nm, 'r')
. That filename expression works because of the Path
class. It is also rendered as appropriate for each operating system on which the code is run.
OK, for better or worse, that works. Though I am thinking I will add a run_app.py in the root directory at some point. That should make setting the various application paths a little easier. Who knows, the project layout may undergo more changes.
We are now in a position to look at refactoring the current code into suitable modules/packages.
Refactor Code Into Multiple Focused Modules/Packages
Chart Module
Let’s start with the charting code. Only one function and a couple of imports to be moved. I moved the function plot_bar_chart()
from py_play\population\population_by_age.py
to py_play\population\chart\chart.py
. Then I moved the imports for pyplot and numpyfrom and to the same modules. In population_by_age.py
I added import chart
and modified the call to the plotting function in process_plot()
to include the package name, chart.plot_bar_chart(...)
. I figured we’d be good to go. No such luck.
(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population/population_by_age.py
Please provide the country and year for which you wish a plot, or 'q' to quit.
Country: peru
4-digit year (1950-2100): 2002
Attempting to plot data for peru in the year 2002.
Traceback (most recent call last):
File "r:/learn/py_play/population/population_by_age.py", line 107, in <module>
process_plot(country, year, csv_nm)
File "r:/learn/py_play/population/population_by_age.py", line 76, in process_plot
chart.plot_bar_chart(proper_c_nm, year, pop_data)
AttributeError: module 'chart' has no attribute 'plot_bar_chart'
I then tried a variety of import statements. Googled and googled. Read and read. I stil don’t fully understand how imports in my project layout really work. But I did find a solution. It all ends up coming down to what Python sees with respect to the file structure when it is run. On my system this is what the Python path variable contains:
sys.path: ['r:\\learn\\py_play\\population', 'E:\\appDev\\Miniconda3\\envs\\base-3.8\\python38.zip', 'E:\\appDev\\Miniconda3\\envs\\base-3.8\\DLLs', 'E:\\appDev\\Miniconda3\\envs\\base-3.8\\lib', 'E:\\appDev\\Miniconda3\\envs\\base-3.8', 'E:\\appDev\\Miniconda3\\envs\\base-3.8\\lib\\site-packages']
The key item is that first value: r:\\learn\\py_play\\population
. Because the first thing we execute is a module in that directory, that is what it sets up as its starting location. Everything else in the list is stuff added on start up by Python or conda. Essentially it means Python will look for imports starting in the population
directory. If it doesn’t find it there it will continue looking in the other entries in the path variable. So, we need to tell it to look in the chart
folder and load the chart.py
module. The following finally worked: from chart import chart
.
And, bingo, the app once again displays charts. Hopefully with the correct data. I haven’t yet really checked that very carefully.
One issue though is that VS Code keeps telling me there is an error in population_by_age.py
. Specifically:
{
"resource": "/r:/learn/py_play/population/population_by_age.py",
"owner": "python",
"code": "import-error",
"severity": 8,
"message": "Unable to import 'chart'",
"source": "pylint",
"startLineNumber": 4,
"startColumn": 1,
"endLineNumber": 4,
"endColumn": 1
}
I am guessing this is because VS Code is starting the application from the py_play
directory. And the path to the chart package is not correct if working from that directory. Don’t know if there is a way I can tell VS Code to ignore the error. I committed my changes.
Oh yes, while doing all that problem troubleshooting, I ended up with a bunch more __pycache__
folders. Git wasn’t ignoring them. So I had to mofify my .gitignore
. I changed __pycache__/*
to a plain __pycache__
and added *.pyc
for good measure. That worked. I committed this change as well (separately of course.)
Fix a Bit of Bad Logic
Earlier I had added the code to generate a path for the data folder, DATA_DIR
. And, I then used that variable in the function get_file_lines()
to open the CSV file: csv_fl = open(DATA_DIR / csv_nm, r)
. Well the call to get_file_lines()
includes a file_nm
parameter. And, this function is currently only called from process_plot()
which also includes a file name parameter. I should be settting the csv file name in the call to process_plot()
and letting it pass it on to other called functions. And, I figured I could also give that parameter a default value. So, I changed the definition to read: process_plot(country, year, fl_nm=DATA_DIR / csv_nm):
. And I changed the call open()
call in get_file_lines()
to read: csv_fl = open(fl_nm, 'r')
. And committed my changes. Now we can carry on.
Population Database Module/Package
Okay, moved the following from population_by_age.py
to database/population.py
:
import csv
- definitions for csv index value contants, e.g. X_NM
- defintions for the functions: get_pop_data(), get_file_lines()
Then I updated population_by_age.py
:
- added
from database import population as pdb
, using alias as don’t want to type full name each time - modified
process_plot()
to use the package label where appropriate (function calls and index variable)
Simple test says it works. Though VS Code still complaining about the imports from my application.
population_by_age.py
import argparse
from pathlib import Path
from chart import chart
from database import population as pdb
# Test for the two most likely starting working directories.
# If found set DATA_DIR accordingly. Using pathlib to cover operating system issues.
# If not one of them print message and exit.
p = Path.cwd()
if p.name == 'py_play':
DATA_DIR = Path('./data')
elif p.name == 'population':
DATA_DIR = Path('../data')
else:
print(f"\nCurrent working directory is an unexpected location, {p}!")
print(f"Please run the application from the root directory. Quitting application")
exit(1)
csv_nm = 'WPP2019_PopulationByAgeSex_Medium.csv'
def process_plot(country, year, fl_nm=DATA_DIR / csv_nm):
lines = pdb.get_file_lines(country, year, fl_nm)
if lines:
proper_c_nm = lines[0][pdb.X_NM]
pop_data = pdb.get_pop_data(lines)
chart.plot_bar_chart(proper_c_nm, year, pop_data)
else:
print(f"\n! No data found for '{country}' and '{year}'.\n")
parser = argparse.ArgumentParser()
# long name preceded by --, short by single -
parser.add_argument('--country', '-c', help='Country name')
# even though we need to pass it to other functions as string, set type to int, as a means of error checking
# and we know that we currently only have data for the years 1950, 2100
# parser.add_argument('--year', '-y', type=int, help='4-digit year', choices=range(1950, 2101))
# didn't like the help/error output, so leaving choices argument out
parser.add_argument('--year', '-y', type=int, help='4-digit year')
args = parser.parse_args()
if args.country and args.year and args.year >= 1950 and args.year <= 2100:
process_plot(args.country, str(args.year), csv_nm)
else:
while True:
print("Please provide the country and year for which you wish a plot, or 'q' to quit.")
country = input("Country: ")
if country.lower() == 'q':
break
year = input("4-digit year (1950-2100): ")
if year.lower() == 'q':
break
if year < '1950' or year > '2100':
print('Country must be in the range 1950 - 2100 inclusive!')
continue
print(f'Attempting to plot data for {country} in the year {year}.')
process_plot(country, year)
chart.py
import matplotlib.pyplot as plt
import numpy as np
def plot_bar_chart(country, year, pop_data):
# define the x-labels for the chart
x_labels = pop_data.keys()
# get the y-values for each x-label
x_values = pop_data.values()
# figure out where to put each of the x-labels based on their size, nice of numpy to help
y_pos = np.arange(len(x_labels))
# because of the x-label sizes, we need a largish display
plt.figure(figsize=(15,7.5))
# give matplotlib.pyplot the values it needs to sort the chart
plt.bar(y_pos, x_values, align='center', alpha=0.5)
# tell it what the x-labels are and where to put them
plt.xticks(y_pos, x_labels)
# add some info regarding the axes and give the chart a title.
plt.xlabel('Age Group')
plt.ylabel('Population (1000s)')
plt.title(country + ' ' + year + ': Population by Age Group')
# generate the plot
plt.show()
population.py
import csv
# indices to certain fields in the CSV file data
X_NM = 1
X_YR = 4
X_AGE_GRP = 6
X_TOT_POP = 11
def get_pop_data(src_data):
age_group_data = {}
for row in src_data:
# save the stuff we want to our data dictionary
age_group_data[row[X_AGE_GRP]] = float(row[X_TOT_POP])
return age_group_data
def get_file_lines(country, year, file_nm):
lines = []
found_pair = False
csv_fl = open(file_nm, 'r')
#for line in csv_fl:
# csv reader returns a reader object which iterate over the lines of the csv file
r = csv.reader(csv_fl, delimiter=',', quotechar='"')
for row in r:
# don't forget to account for user typing preferences
found = country.lower() == row[X_NM].lower() and year == row[X_YR]
if found:
found_pair = True
lines.append(row)
if found_pair and not found:
break
csv_fl.close()
return lines
Don’t forget to commit all these changes to version control. And that’s it for this one.