Well, this is certainly taking considerably longer than I planned. So, for now I am going to ignore the “Search” menu item. I would like to get the plotting code reworked to use the data structure (list of lists) being returned by the chart user interaction functions.

Getting Data and Plotting Chart

We previously had a small function in our main app module, specifically process_plot(country, year, fl_nm=DATA_DIR / csv_nm), that took the selection criteria we got from the user to generate the plot. That function called the appropriate database/population module functions to get the required population data. It then called a plotting function in the chart/chart module, passing in the population data, in order to get the chart plotted. I am now thinking the the chart/chart module would be a better place for that function. That way our main module, menues.py at the moment, doesn’t need to know anything about the workings of database/population. Nor, really, anything much about chart/chart. Hope you agree.

So, let’s sort that in the code for the main menu loop.

Who Does What?

First we have to decide who handles the situation where the data structure is empty or incomplete. Seems to me if this is the case, we will need to redisplay a menu or exit the application. So this very likely belongs in the module responsible for interacting with the user. For now that would be our menues.py module.

I will check the elements of the data structure returned by the do_chart_*() functions to make sure they are there (i.e. have a length greater than 1 if appropriate) and/or do not have a value of ‘X’ in the definitive location.

Oh yes, I also removed the test/debug print statements after each call to one of the do_chart_* functions. My section of the loop related to the charting sub-menu now looks like this:

elif u_choice.upper() == 'C':
      while True:
        c_choice = do_chart_menu()
        if c_choice.upper() == 'Q':
          break
        else:
          if c_choice == '1':
            p_data = do_chart_1ma()
            p_nms, p_yrs, p_grp = p_data
          elif c_choice == '2':
            p_data = do_chart_m1a()
            p_nms, p_yrs, p_grp = p_data
          elif c_choice == '3':
            p_data = do_chart_mm1()
            p_nms, p_yrs, p_grp = p_data
          else:
            print(f"What the\u8230?")
          # if we have user choices for all values, send to plot function
          # otherwise ???
          do_plot = len(p_nms) >= 1 and 'X' not in p_nms
          if do_plot:
            do_plot = do_plot and p_yrs[0].upper() != 'X'
          if do_plot:
            # stub code, will eventually change to call the chart module function
            print(f"\n\tPlotting chart for {p_data}")
          else:
            print(f"\n\t{TRED}Not{TRESET} provided with {TRED}sufficient parameters{TRESET} to produce a plot!")

New Data Generation and Charting Functions

The new function in the chart module is going to have to look at the data structure it receives, figure out how to handle it, call the appropriate database modules to get the necessary population related data and then use that to generate a chart for the user. We will also have to sort the titles, axis labels and values, etc.

Let’s start by sorting out how to get the necessary data to be able to produce a plot. And how to structure it for passing on to the plotting function. I expect it will be similar to the structure we are using to pass the data from the user data selection functions.

Let’s start by looking a potential return values for the various chart types.

Plotting chart for [['Zimbabwe'], ['2005', 2], ['all']]
Plotting chart for [['Venezuela (Bolivarian Republic of)', 'Chile'], ['2010', 1], ['all']]
Plotting chart for [['Zimbabwe', 'Mozambique'], ['2010', 5], ['65-69']]

In the first case we’d need to have data for 1 country for 2 years. So, likely should replace the year data list with a list of a single dictionary keyed on year with a value that is a list of all the volume data for each age group for that year for the specified country. E.G.

image of test displaying proposed data for above case

In the second, we’d probably do the reverse. That is, replace the country name list with a list of dictionaries keyed on country name with a value that is a list of all the volume data for each age group for that country for the specified year.

The third case does rather look a bit ugly. Combining the two approaches above as is just won’t work. So, I am thinking we can instead use a list of nested data dictionaries within the country list. The exterior dictionary would be keyed on country name. The value for each name would be a dictionary keyed on year with the value for each year being the population for the specified age group.

We may also need to consider adding a list value to the structure that specifies the chart type. Time will tell. The structure for each case would look something like the following.

# Type 1
user data: [['China'], ['2005', 2], ['all']]

chart data: [['China'], [{'2005': [83125.614, 88198.485, 99534.8, 130919.518, 102106.616, 98401.096, 122890.403, 127741.409, 104036.379, 84552.285, 86472.205, 59448.529, 43774.085, 37657.106, 29259.212, 18077.959, 9364.175, 4012.276, 1038.165, 151.163, 14.9], '2006': [81929.946, 87099.154, 94916.306, 128824.878, 107739.876, 96132.462, 118364.611, 129242.308, 109191.626, 86185.272, 87457.239, 64322.902, 45208.141, 37750.817, 29767.234, 18819.452, 9781.094, 4253.152, 1224.934, 181.07, 16.17]}], ['all']]

# Type 2
user data: [['Venezuela (Bolivarian Republic of)', 'Chile'], ['2010', 1], ['all']]

chart data: [[{'Chile': [1224.229, 1215.911, 1327.57, 1462.566, 1422.594, 1317.564, 1280.183, 1239.245, 1220.229, 1166.694, 1034.5, 884.093, 669.755, 519.24, 401.13, 310.239, 218.714, 100.888, 36.256, 9.329, 1.602], 'Venezuela (Bolivarian Republic of)': [2900.078, 2847.209, 2758.383, 2698.861, 2619.602, 2368.262, 2163.254, 1922.294, 1795.301, 1587.041, 1343.277, 1046.425, 806.459, 597.682, 421.854, 274.75, 162.923, 82.601, 32.885, 9.213, 1.588]}], ['2010', 1], ['all']]

# Type 3
user data: [['Zimbabwe', 'Mozambique'], ['2010', 5], ['65-69']]

chart data: [[{'Mozambique': {'2010': 298.583, '2011': 303.966, '2012': 310.069, '2013': 316.745, '2014': 323.74}, 'Zimbabwe': {'2010': 154.329, '2011': 151.155, '2012': 146.272, '2013': 141.601, '2014': 139.818}}], ['2010', 5], ['65-69']]

I am also thinking that we will need to add new functions in database/population.py to retrieve the requested data from the CSV for each of the cases. Previously we were really only getting one country for one year. So the function we were using was perfectly fine. However, for the new chart types, reusing those functions would mean repeated searches of the CSV database file, often going right over an area previously searched. This should be really obvious for the one country, multiple year case. So, new functions that reduce the time it takes to get the data and eliminate repeated passes through the CSV file if possible.

Module database.population

I will begin by sorting the 3 new database/population.py functions that will replace the original functions we initially wrote. Gone, eventually, will be: get_pop_data() and get_file_lines(). We will be adding: get_1cr_years_all() for Type 1 charts, get_crs_1yr_all() for Type 2 charts and get_crs_years_one() for Type 3 charts.

I will add test code to population.py in a if __name__ block. In the test section I will generate data for each case using the new function. Then use the old functions to get the same data and compare the results. I will add variables to control which test is done — reduce running time and keep the terminal window just a touch tidier.

database.population.get_1cr_years_all(cr_nm, years, csv_path=CSV_FL)

For my own peace of mind, I started by writing the function based on the old way — multiple passes through the CSV file. Then I completely re-wrote it to eliminate the multiple file passes.

Had a bit of an issue opening the CSV file when testing. Had to add some path information to the default CSV file name constant, CSV_FL. I will let you sort that yourself.

Basically, I open the file then use the csv module to traverse the file line by line. I flag when I’ve found the first matching row. For each matching row, the volume is appended to a list for the specific year in a dictionary. After finding the first matching row, a non matching row terminates the loop. And the collected data is returned. But do note that we need to check for multiple dates, which are fortunately sequential. So, I check to see if the year field in the current row is in the list of dates passed to the function. I use the field index constants we previously created when checking lines and obtaining data.

This will likely be the approach for all three functions.

Give it a try. I will let you sort what parameters you feel are required. Then check out my version below. But first here’s my test code section for this first function. I don’t tend to add functions for the test code. Though I likely could. I usually just copy, paste and modify.

if __name__ == '__main__':
  do_tst_1 = True
  do_tst_2 = False
  do_tst_3 = False
  if do_tst_1:
    # [['Zimbabwe'], ['2005', 2], ['all']]
    tst_cr = 'Zimbabwe'
    tst_yr = ['2005', '2006']
    print(f"\ntesting: get_1cr_years_all({tst_cr}, {tst_yr})")
    plt_data = get_1cr_years_all(tst_cr, tst_yr)
    print("\nplt_data = {")
    print("\n".join("'{}':\t{}".format(k, v) for k, v in plt_data.items()))
    print("}\n")
    # get the data using first set of functions
    cmp_data = {}
    for tyr in tst_yr:
      old_fn_rslt = get_file_lines(tst_cr, tyr, CSV_FL)
      old_fn_rslt = get_pop_data(old_fn_rslt)
      cmp_data[tyr] = list(old_fn_rslt.values())
    print("comparing to results for same country and years using first set of functions")
    print("\ncmp_data = {")
    print("\n".join("'{}':\t{}".format(k, v) for k, v in cmp_data.items()))
    print("}\n")
    print(f"plt_data == cmp_data: {str(plt_data == cmp_data)}\n")

  if do_test_2:
    pass

  if do_test_3:
    pass

And, my code for this first function, get_1cr_years_all():

def get_1cr_years_all(cr_nm, years, csv_path=CSV_FL):
  p_data = {}
  fnd_1st = False
  
  for yr in years:
    p_data[yr] = []

  csv_fl = open(csv_path, 'r')
  # csv reader returns a reader object which iterates over the lines of the csv file
  r = csv.reader(csv_fl, delimiter=',', quotechar='"')
  for row in r:
    # Because of how we've coded the user interface chart menu functions,
    # I expect the case of cr_nm to always match that in the row, but doesn't hurt to be safe
    found = cr_nm.lower() == row[X_NM].lower() and row[X_YR] in years
    if found:
      fnd_1st = True
      p_data[row[X_YR]].append(float(row[X_TOT_POP]))
    if fnd_1st and not found:
      break
  csv_fl.close()

  return p_data

My test output for this function looks like the following. Remember you need to run the database.population module all by itself.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/database/population.py

testing: get_1cr_years_all(Zimbabwe, ['2005', '2006'])

plt_data = {
'2005': [1771.441, 1621.882, 1600.023, 1561.421, 1298.635, 1035.271, 756.999, 576.168, 445.103, 350.969, 286.036, 203.271, 193.762, 148.494, 106.305, 69.62, 34.977, 13.528, 2.482, 0.288, 0.022]
'2006': [1805.003, 1625.782, 1578.879, 1546.96, 1313.633, 1054.357, 778.217, 582.638, 449.502, 353.462, 290.573, 206.938, 188.416, 150.159, 106.677, 70.143, 36.22, 14.38, 3.177, 0.354, 0.026]
}

comparing to results for same country and years using first set of functions

cmp_data = {
'2005': [1771.441, 1621.882, 1600.023, 1561.421, 1298.635, 1035.271, 756.999, 576.168, 445.103, 350.969, 286.036, 203.271, 193.762, 148.494, 106.305, 69.62, 34.977, 13.528, 2.482, 0.288, 0.022]
'2006': [1805.003, 1625.782, 1578.879, 1546.96, 1313.633, 1054.357, 778.217, 582.638, 449.502, 353.462, 290.573, 206.938, 188.416, 150.159, 106.677, 70.143, 36.22, 14.38, 3.177, 0.354, 0.026]
}

plt_data == cmp_data: True

Lovely, on we go to the next function, get_crs_1yr_all().

database.population.get_crs_1yr_all(cr_nms, year, csv_path=CSV_FL)

As mentioned earlier, I will use one pass through the CSV file to get all the desired data. But, this time we have multiple, possibly unsequential, country/region names rather than a series of sequential years. So, we need to modify our control procedure/process a little. And, to be safe we really need to sort the list of country/region names before we start parsing the CSV file since the data in the file is in alphabetical order by country/region name. Then by year order for each country/region. In this case, of course, the year remains the same for each test of a row. But once we have a country & year combination done we need to change the name to the next in the list and hunt for that data. Once we have gone through the list of names we can stop reading through the file.

Give it a go — don’t look at my code until you do.

I also changed how I added empty entries to the dictionary. Not sure I needed to do it the way I did, but always fun to play around.

def get_crs_1yr_all(cr_nms, year, csv_path=CSV_FL):
  p_data = {}
  # the following will always be False if we haven't yet found
  # the first row for the currently sought after country/year
  fnd_1st = False
  # sort names in descending order so can make one pass through CSV file
  cr_sort = cr_nms[:]
  cr_sort.sort()

  # track the index and name of the country/region we are currently searching for
  cr_ndx = 0
  curr_cr = cr_sort[cr_ndx]

  csv_fl = open(csv_path, 'r')
  r = csv.reader(csv_fl, delimiter=',', quotechar='"')
  for row in r:  
    found = curr_cr.lower() == row[X_NM].lower() and row[X_YR] == year
    if found:
      if not fnd_1st:
        # found first row for the current cr name and year
        # so create an empty entry in the dictionary for that country
        # and flag our success
        p_data[row[X_NM]] = []
        fnd_1st = True
      # record data whenever we find a matching row
      p_data[row[X_NM]].append(float(row[X_TOT_POP]))
    if fnd_1st and not found:
      # we previously found a matching row, but the current row no longer matches
      # increment our name list index
      cr_ndx += 1
      if cr_ndx < len(cr_sort):
        # if more names to go, update currently sought after name
        curr_cr = cr_sort[cr_ndx]
        # check whether the current row is the next sought after country and year or not
        if curr_cr.lower() == row[X_NM].lower() and row[X_YR] == year:
          p_data[row[X_NM]] = []
          p_data[row[X_NM]].append(float(row[X_TOT_POP]))
        else:  
          fnd_1st = False
      else:    
        # if have gone through cr_nms list, exit loop    
        break
  csv_fl.close()

  return p_data

And, the test results look like the following. You did change the test control variables accordingly?

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/database/population.py

testing: get_crs_1yr_all(['Chile', 'Venezuela (Bolivarian Republic of)'], 2010)

plt_data = {
'Chile':        [1224.229, 1215.911, 1327.57, 1462.566, 1422.594, 1317.564, 1280.183, 1239.245, 1220.229, 1166.694, 1034.5, 884.093, 669.755, 519.24, 401.13, 310.239, 218.714, 100.888, 36.256, 9.329, 1.602]
'Venezuela (Bolivarian Republic of)':   [2900.078, 2847.209, 2758.383, 2698.861, 2619.602, 2368.262, 2163.254, 1922.294, 1795.301, 1587.041, 1343.277, 1046.425, 806.459, 597.682, 421.854, 274.75, 162.923, 82.601, 32.885, 9.213, 1.588]
}

comparing to results for same country and years using first set of functions

cmp_data = {
'Chile':        [1224.229, 1215.911, 1327.57, 1462.566, 1422.594, 1317.564, 1280.183, 1239.245, 1220.229, 1166.694, 1034.5, 884.093, 669.755, 519.24, 401.13, 310.239, 218.714, 100.888, 36.256, 9.329, 1.602]
'Venezuela (Bolivarian Republic of)':   [2900.078, 2847.209, 2758.383, 2698.861, 2619.602, 2368.262, 2163.254, 1922.294, 1795.301, 1587.041, 1343.277, 1046.425, 806.459, 597.682, 421.854, 274.75, 162.923, 82.601, 32.885, 9.213, 1.588]
}

plt_data == cmp_data: True

database.population.get_crs_years_one(cr_nms, years, a_grp, csv_path=CSV_FL)

And now on the last and what I thought would be the most challenging and the most fun.

We really need to account for:

  • reaching the end of the country/region names list
  • getting the last year for a given country/region name (time to go to next name)
  • getting only one population value for the specific age group

Well we’ve pretty much covered all of that in the previous two functions. Though, we may need to do our checks in a different order or at a different spot in our file reading loop.

Give it a shot, including the test code. Then come back when you are ready.

First of all, here’s my full test code for all three cases. Only the last case is enabled. In the that last test code, you did remember to pull out only the requested age group when using the old functions?

if __name__ == '__main__':
  do_tst_1 = False
  do_tst_2 = False
  do_tst_3 = True

  if do_tst_1:
    # [['Zimbabwe'], ['2005', 2], ['all']]
    tst_cr = 'Zimbabwe'
    tst_yr = ['2005', '2006']
    print(f"\ntesting: get_1cr_years_all({tst_cr}, {tst_yr})")
    plt_data = get_1cr_years_all(tst_cr, tst_yr)
    print("\nplt_data = {")
    print("\n".join("'{}':\t{}".format(k, v) for k, v in plt_data.items()))
    print("}\n")
    # get the data using first set of functions
    cmp_data = {}
    for tyr in tst_yr:
      old_fn_rslt = get_file_lines(tst_cr, tyr, CSV_FL)
      old_fn_rslt = get_pop_data(old_fn_rslt)
      cmp_data[tyr] = list(old_fn_rslt.values())
    print("comparing to results for same country and years using first set of functions")
    print("\ncmp_data = {")
    print("\n".join("'{}':\t{}".format(k, v) for k, v in cmp_data.items()))
    print("}\n")
    print(f"plt_data == cmp_data: {str(plt_data == cmp_data)}\n")

  if do_tst_2:
    # [['Venezuela (Bolivarian Republic of)', 'Chile'], ['2010', 1], ['all']]
    tst_cr = ['Venezuela (Bolivarian Republic of)', 'Chile']
    tst_cr.sort()
    tst_yr = '2010'
    print(f"\ntesting: get_crs_1yr_all({tst_cr}, {tst_yr})")
    plt_data = get_crs_1yr_all(tst_cr, tst_yr)
    print("\nplt_data = {")
    print("\n".join("'{}':\t{}".format(k, v) for k, v in plt_data.items()))
    print("}\n")
    # get the data using first set of functions
    cmp_data = {}
    for tcr in tst_cr:
      old_fn_rslt = get_file_lines(tcr, tst_yr, CSV_FL)
      old_fn_rslt = get_pop_data(old_fn_rslt)
      cmp_data[tcr] = list(old_fn_rslt.values())
    print("comparing to results for same country and years using first set of functions")
    print("\ncmp_data = {")
    print("\n".join("'{}':\t{}".format(k, v) for k, v in cmp_data.items()))
    print("}\n")
    print(f"plt_data == cmp_data: {str(plt_data == cmp_data)}\n")

  if do_tst_3:
    # [['Zimbabwe', 'Mozambique'], ['2010', 5], ['65-69']]    tst_cr = ['Venezuela (Bolivarian Republic of)', 'Chile']
    tst_cr = ['Venezuela (Bolivarian Republic of)', 'Chile']
    tst_cr.sort()
    tst_yr = ['2010', '2011', '2012', '2013', '2014']
    tst_grp = '65-69'
    print(f"\ntesting: get_crs_years_one({tst_cr}, {tst_yr}, tst_grp)")
    plt_data = get_crs_years_one(tst_cr, tst_yr, tst_grp)
    print("\nplt_data = {")
    print("\n".join("'{}':\t{}".format(k, v) for k, v in plt_data.items()))
    print("}\n")
    # get the data using first set of functions
    cmp_data = {}
    for tcr in tst_cr:
      cmp_data[tcr] = {}
    for tcr in tst_cr:
      for tyr in tst_yr:
        old_fn_rslt = get_file_lines(tcr, tyr, CSV_FL)
        old_fn_rslt = get_pop_data(old_fn_rslt)
        cmp_data[tcr][tyr] = old_fn_rslt[tst_grp]
    print("comparing to results for same country and years using first set of functions")
    print("\ncmp_data = {")
    print("\n".join("'{}':\t{}".format(k, v) for k, v in cmp_data.items()))
    print("}\n")
    print(f"plt_data == cmp_data: {str(plt_data == cmp_data)}\n")

And, my function looks like this:

def get_crs_years_one(cr_nms, years, a_grp, csv_path=CSV_FL):
  p_data = {}
  nbr_crs = len(cr_nms)
  fnd_1st = False
  # sort names in descending order so can make one pass through CSV file
  # Should possibly also sort years list, but it is expected to be in the correct order
  cr_sort = cr_nms[:]
  cr_sort.sort()
  cr_ndx = 0
  curr_cr = cr_sort[cr_ndx]
  
  csv_fl = open(csv_path, 'r')
  # csv reader returns a reader object which iterates over the lines of the csv file
  r = csv.reader(csv_fl, delimiter=',', quotechar='"')
  for row in r:  
    found = curr_cr.lower() == row[X_NM].lower() and row[X_YR] in years and row[X_AGE_GRP] == a_grp
    if found:
      if not fnd_1st:
        p_data[row[X_NM]] = {}
        fnd_1st = True
      p_data[row[X_NM]][row[X_YR]] = float(row[X_TOT_POP])
      # have we processed the last year for this country
      if row[X_YR] == years[-1]:
        # have last year for current country/region, move to next name
        # or if list done stop reading the file
        cr_ndx += 1
        if cr_ndx < nbr_crs:
          curr_cr = cr_sort[cr_ndx]
          if curr_cr.lower() == row[X_NM].lower() and row[X_YR] in years and row[X_AGE_GRP] == a_grp:
            p_data[row[X_NM]] = {}
            p_data[row[X_NM]][row[X_YR]] = float(row[X_TOT_POP])
          else:  
            fnd_1st = False
        else:    
          # if have gone through cr_nms list, exit loop    
          break
  csv_fl.close()

  return p_data

And the test results:

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/database/population.py

testing: get_crs_years_one(['Chile', 'Venezuela (Bolivarian Republic of)'], ['2010', '2011', '2012', '2013', '2014'], tst_grp)

plt_data = {
'Chile':        {'2010': 519.24, '2011': 536.287, '2012': 554.583, '2013': 575.428, '2014': 600.82}
'Venezuela (Bolivarian Republic of)':   {'2010': 597.682, '2011': 621.569, '2012': 648.274, '2013': 676.335, '2014': 703.57}
}

comparing to results for same country and years using first set of functions

cmp_data = {
'Chile':        {'2010': 519.24, '2011': 536.287, '2012': 554.583, '2013': 575.428, '2014': 600.82}
'Venezuela (Bolivarian Republic of)':   {'2010': 597.682, '2011': 621.569, '2012': 648.274, '2013': 676.335, '2014': 703.57}
}

plt_data == cmp_data: True

Spinner: While You Wait

Before I call it a posting’s worth, I thought I’d add use that spinner module. I found the new data retrieval functions worked quite quickly. But, code is written, may as well use it. So in each function I will call the spinner function at the top of the CSV file reading loop. Then remove it from the terminal window when exiting the file read loop. I won’t bother showing the updated code, but here’s a video of it working.

And, of course, I did not add the spinner to the old functions or to the sections of the test code calling the old functions.

Also, do note that Pylint in VS Code has some issues finding imports in my project file structure. So it complained that Unable to import ‘spinner’ when I added import spinner as spnr to the database/population module. But when I ran the module, everything worked just fine.

A reminder that we don’t have to understand everything all at once. We just need to find ways to make it work while we learn enough to understand why certain things are happening in our coding environment.

See You Next Time

In the next post, semi-weekly schedule, we will tackle the changes to the chart module/package. Until then have fun coding.

Postscript

I decided when reviewing the draft prior to posting to show the code I used to sort my CSV file path issue mentioned in the database.population.get_1cr_years_all() section above.

import matplotlib.pyplot as plt
import numpy as np

# the following is to make sure I can open the csv file when running the module independently
if __name__ == '__main__':
  import sys
  from pathlib import Path
  file = Path(__file__).resolve()
  parent, root = file.parent, file.parents[1]
  sys.path.append(str(root))
  # Additionally remove the current file's directory from sys.path
  try:
    sys.path.remove(str(parent))
  except ValueError: # Already removed
    pass