Code Fix

While getting ready for the next iteration, I noticed I hadn’t modified my plotting code to display the country name along with the year in the title. The title still reads “World” even though I am trying to plot the data for Australia in 2000. So I fixed that by adding a parameter for the country to the plot_bar_chart() function, modified the line that prints the title accordingly, and added, appropriately, args.country to the arguments in the function call. Then I committed my changes to the local repository and GitHub.

More Flexible Country and Year Selection

Bet you figured we were done with this exercise. Well not quite. At the moment we need to repeat the three lines if we wish to plot data for another country and year combination. Or, change the country and year in the function calls. The latter seems a bit inefficient and the first is WET (i.e. the opposite of DRY). So, I’d like to modify the program to accept user input at the command line and use that to generate the appropriate chart. If the data for the requested country and year are not in our CSV file, an appropriate message should be presented to the user.

I can see two approaches here. We add the country and year following the command to execute the program file. Something like:

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py -c Chile -y 1953

The other would be, once the program is run, to ask the user for the country and year, then present the bar chart. We could repeat that process until the user indicates they want to quit and exit the program. Something like:

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
For which country: Chile
And which year: 2001
You asked for a plot of age related population for Chile in 2001.

Or we could allow for both, if the country and year are specified on the command line when the program is first executed, we provide the plot for that combination. Or an error message. Otherwise we go start the loop asking for country and year, provide the plot or error message, and stop when asked. I am going to use this combined approach. I will start by coding the command line options approach, then the user input loop approach. There will of course be plenty of research involved.

Command Line Options

Okay, there are a few ways to tackle command line options, but I am going to use the argparse module. I believe you will prefer it as well; as, for example, it “automatically generates help and usage messages and issues errors when users give the program invalid arguments.” We will need to make both options optional. We could make both required if either one is provided. The other option would be to use defaults for both, but only generate the plot if at least one of the two options is provided. I am just going to make both optional, and only generate a plot if both are provided. Makes the most sense to me. Your view may differ. I will leave coding of differing options up to you.

argparse is one of the modules in the Python standard library, so we don’t have to install anything. But we still need to import the package. So, up top with the other imports, add:

import argparse

Since these options are optional we can’t use positional arguments. I.E. country always comes first followed by year. We need to use option labels/names (long or short). I will configure for both long and short options. So having read the tutorial, here’s my kick at the can. Note, I have commented out the lines that plot the chart for the time being. Don’t need a chart each time I test my argument parsing code. If using VS Code you can highlight the lines you wish to comment and press Ctrl+/ or use Edit -> Toggle Line Comment. I haven’t really used the block comment edit command. Now onto the parsing code.

parser = argparse.ArgumentParser()
# long name preceded by --, short by single -
parser.add_argument('--country', '-c', help='Country name')
# even though we need to pass it to other functions as string, set type to int, as a means of error checking
# and we know that we currently only have data for the years 1950, 2100
# parser.add_argument('--year', '-y', type=int, help='4-digit year', choices=range(1950, 2101))
# didn't like the help/error output, so leaving choices argument out
parser.add_argument('--year', '-y', type=int, help='4-digit year')

args = parser.parse_args()

If you commented out the earlier plot code, added something like the above and ran the program file. You should actually see nothing in the terminal window. Both options are optional. Trying adding -h at the command line. Click in the terminal window (Terminal tab in the output window), hit the up arrow, and type space + ‘-h’ at the end of the line. Hit enter. You should see something like the following. You may have different help messages or argument names.

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py -h
usage: population_by_age.py [-h] [--year YEAR] [--country COUNTRY]

optional arguments:
  -h, --help            show this help message and exit
  --year YEAR, -y YEAR  4-digit year
  --country COUNTRY, -c COUNTRY
                        Country name

Let’s add print(args) after the parse_args() line. And run the line with and without arguments. Including the single argument case.

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
Namespace(country=None, year=None)

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py -c Canada     
Namespace(country='Canada', year=None)

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py -c Canada -y 1950
Namespace(country='Canada', year=1950)

Note that if an option is not provided, argparse assigns it the value None. And, in Python code, None equates to False in boolean expressions.

Don’t forget you need to run the code manually in the terminal in order to set the command line options. I am sure this could be better done using test files, but I haven’t learned how to go about that. Or you could open an Anaconda prompt, change to your py_play directory and run the program file adding the desired options. I find it just as easy to use the terminal window in VS Code.

So, now we need to check if we got both arguments. You can access them with args.country and args.year. And check that the year is between 1950 and 2100 inclusive. If so generate the bar chart plot. Don’t forget to convert the year to a string when passing it to the functions. Otherwise, for now provide an error message. Once you’ve got your code working, you can come back and critique my code below.

parser = argparse.ArgumentParser()
# long name preceded by --, short by single -
parser.add_argument('--country', '-c', help='Country name')
# even though we need to pass it to other functions as string, set type to int, as a means of error checking
# and we know that we currently only have data for the years 1950, 2100
# parser.add_argument('--year', '-y', type=int, help='4-digit year', choices=range(1950, 2101))
# didn't like the help/error output, so leaving choices argument out
parser.add_argument('--year', '-y', type=int, help='4-digit year')

args = parser.parse_args()
if args.country and args.year and args.year >= 1950 and args.year <= 2100:
  str_yr = str(args.year)
  lines = get_file_lines(args.country, str_yr, csv_nm)
  pop_data = get_pop_data(args.country, str_yr, lines)
  plot_bar_chart(str(args.year),str_yr, pop_data)
else:
  print('A country and year (1950-2100 inclusive) are required.')
  print('Useage: population_by_age.py -c <CountryName> -y <4-digit year>')

And, some testing results. The last produced the desired bar chart.

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py -c Canada     
A country and year (1950-2100 inclusive) are required.
Useage: population_by_age.py -c <CountryName> -y <4-digit year>

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py -c Canada -y 1949
A country and year (1950-2100 inclusive) are required.
Useage: population_by_age.py -c <CountryName> -y <4-digit year>

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py -c Australia -y 2000

Now would be a good time to commit your changes.

git status
git add .
git commit -m "added command line parameters, optional, for country and year, both required for plot generation"
git push

Requesting User Input

Now let’s look at asking for and processing user input when there are none or insufficient parameters passed on the command line when the program is run. Python, like most languages, has a built-in function just for this purpose: *input([prompt])*. So, if we want the country something like country = input('Which country? ') should do the trick. Because we are going to continually prompt until the user says “no more,” I will put the query code into an endless loop and break out when told to do so by the user.

That also means that we are going to have put the data manipulation and plotting code into it’s own function, so that don’t repeat it in both the if block and the else block where we check for command line options. Keep it DRY. So, let’s do that first.

def process_plot(country, year, fl_nm):
  lines = get_file_lines(country, year, fl_nm)
  pop_data = get_pop_data(country, year, lines)
  plot_bar_chart(country, year, pop_data)

And modify the code in our command line options block, replacing the four lines with the single function call.

if args.country and args.year and args.year >= 1950 and args.year <= 2100:
  process_plot(args.country, str(args.year), csv_nm)
else:
  ...

Once you’re sure it works, commit the changes to version control and push to GitHub (yes, I am repeating myself).

Okay, now to set up the loop and get and process input. For this loop I am going to use the while statement. We could approach the when to quit mechanism a couple of ways. But I am just going to use and endless loop (something you generally want to avoid) and break out of it when required. Something like:

while True:
  if <told to exit>:
    break
  <otherwise do stuff>  

The <told to exit> will be that the user entered a q or Q when asked for a country or year. One thing to note is that input always returns a string representing what the user entered. If you are looking for some other type of data, you will need to make the conversion. And, the linefeed character is removed for us, no need to do a strip(). So, go ahead and give it a try. When you have it working come on back. And if you came up with a better approach than me, all the better.

if args.country and args.year and args.year >= 1950 and args.year <= 2100:
  process_plot(args.country, str(args.year), csv_nm)
else:
  #print('A country and year (1950-2100 inclusive) are required.')
  #print('Useage: population_by_age.py -c <CountryName> -y <4-digit year>')
  while True:
    print("Please provide the country and year for which you wish a plot, or 'q' to quit.")
    country = input("Country: ")
    if country.lower() == 'q':
      break
    year = input("4-digit year (1950-2100): ")
    if year.lower() == 'q':
      break
    if year < '1950' or year > '2100':
      print('Country must be in the range 1950 - 2100 inclusive!')
      continue
    print(f'Attempting to plot data for {country} in the year {year}.')
    process_plot(country, year, csv_nm)

The above worked fine when I used New Zealand for 1960. But died miserably when I asked for Italy in 2020. While you figure it out, I am going to commit my user input loop changes as they do appear to work. Then we’ll sort the bug.

We are searching for lines with Italy and 2020 in them. But what if the year is not 2020, but one of the population values has 2020 in it. And that’s exactly what get_file_lines('Italy', '2020', csv_nm) found at line 653862 of the CSV file:

    380,Italy,2,Medium,1980,1980.5,20-24,20,5,2057.11,2020.619,4077.729

It saved the line to the list. Checked the next line, but didn’t find 2020 in that line.

    380,Italy,2,Medium,1980,1980.5,25-29,25,5,1921.484,1925.897,3847.381

So it quit checking lines, returning a single line of data. And we ended up with a chart with a single bar for the 20-24 age group filling the chart.

I decided that I would check for the year surrounded by commas, as that is how it would appear in the raw line of text from the CSV file. I.E. something like ,2020,.

def get_file_lines(country, year, file_nm):
  lines = []
  found_pair = False
  csv_fl = open(csv_nm, 'r')
  yr_fld = ',' + year + ','
  for line in csv_fl:
    found = country in line and yr_fld in line
    if found:
      found_pair = True
      lines.append(line.strip())
    if found_pair and not found:
      break
  csv_fl.close()
  return lines

Once I made those changes I was presented with a chart that looked a lot more like what I was expecting.

Closing Thoughts

Firstly, you may recall the get_pop_data() function we wrote to extract the data from the appropriate lines of the simulated CSV file had some checks. It checked to make sure it was processing the correct lines (country and year) and stopped when lines no longer contained the desired country/year. We now have a function that extracts those lines from a real CSV file. That means we are no longer passing it a list of lines that may contain undesired lines. So, those checks aren’t required and should likely be removed. But, for now I am likely just going to leave them in the function’s code.

Secondly, we stop pulling lines from the CSV when if country/year are found and then no longer appear in a line. But, we process every line before that. Given the file has 1,404.7543 lines, that can be a huge number of lines we really don’t want to process. At this point I really don’t know how to prevent that. It might be possible to generate a CSV file that shows the starting line for each country. Then skip straight to the first line of a given country and proceed from there. It is something I will research.

And, perhaps finally, what if I wanted to plot charts comparing 2 or more countries, or limit the age groups or both. I would be able to reuse some of the existing fucntions. I might need to rework get_pop_data() to only select the desired age groups rather than all of them. And would likely need to write new plotting functions to support each possible alternative. Or is there a way to get a single function to do whatever based on a configuration parameter of some sort. Or maybe pass functions as well as data as parameters to a single plotting function that will generate the plot I want. More research.

Well, not quite finally. What if the user inputs the country name in lowercase? In most computing languages, ‘canada’ is not equal to ‘Canada’. And, I didn’t deal with the case where the country/year combination is not found in the CSV data file. And rather than enter the full United States of America I just tried United States and 1968. The population numbers looked awful small. So I searched the CSV file (Ctrl + F in VS Code) for United States. The first hit I got was for United States Virgin Islands, and the chart was in fact displaying that country’s data for 1968. In most programming languages, well computers in general, sorting text has some possibly unexpected rules. In this case, uppercase letters sort ahead of lowercase letters. So United States V sorts ahead of United States o. I expect the file was programmatically sorted by the appropriate fields on each line.

So, maybe a bit more on this exercise before I call it quits.

Well for today, time to commit the bug fix and call it a day. You are practicing using version control?

More Windows Terminal

I will be adding an out of schedule short post covering how I set up a shortcut to open Windows Terminal with four pre-configured tabs. It will be available this coming Thursday.

Resources