Followup Thoughts From Last Post

After a bit more thinking on the matter, I have decided the bit of virtually duplicate code in plot_m1a() and plot_1ma() is really not a big deal So, I am not going to write a new function to do that conversion of the population count to percentages. If it becomes a big deal, I can always change my mind.

And, I will, a little later, be moving the ‘percent toggle’ menu item to the charting sub-menu. Really just makes sense. But, that means we will also have to move the code to sync the duplicate variable in chart/chart.py.

Percentage Type 3 Plots

But, I will definitely write a function to generate the data currently retrieved in the Type 3 block in plot_population(). If I am not happy with what is required in that function, I may write a new function in database/population.py to help get the necessary annual population totals . Let’s see what the new function, get_t3_data(), in chart/chart.py will look like.

Okay, the data structure used for Type 3 charts is a bit different from the other two chart types. And, it does not currently contain the data needed to produce annual population totals. Without which we can’t calculate percentages for the age group being displayed. We will need to get annual totals for each of the countries/regions and years being displayed in the chart. So, let’s have a look at the simplest approach — repeated calls to get_1cr_years_all().

The Type 3 data is in a nested dictionary. The outermost dictionary is keyed on country/region name and the inner most on year. So something like data[cr_name][p_year] should give us the population count for the specified country and year. Using the total population for the same name and year will allow us to calculate the percentage for that country and year. For a given country we can get the population for a range of years using get_1cr_years_all(). So a nested loop should get the whole nested data dictionary converted. Outer loop based on the country/regions in the outer dictionary keys. And an inner loop based on the years/keys in the dictionary retured by get_1cr_years_all().

Something like:

def get_t3_data(t3_nms, t3_yrs, t3_grp):
  t3_data = pdb.get_crs_years_one(t3_nms, t3_yrs, t3_grp)

  if do_percent:
    for t3_cr in t3_nms:
      curr_cr = pdb.get_1cr_years_all(t3_cr, t3_yrs)
      for t3_yr, data in curr_cr.items():
        tmp_tot = sum(data)
        t3_data[t3_cr][t3_yr] = t3_data[t3_cr][t3_yr] / tmp_tot * 100
  
  return t3_data

Looks like it would actually have been small enough to include in the appropriate code block in plot_population() afterall. But, have the function done, may as well use it.

    # get data for combination of each name in p_nms and each year specified by p_yrs
    #dbg_data = pdb.get_crs_years_one(p_nms, yr_list, p_grp[0])
    dbg_data = get_t3_data(p_nms, yr_list, p_grp[0])
    plot_data[0].append(dbg_data)

And, let’s make sure the correct y-axis label is used in the chart displayed.

  # Add some text for labels, title and custom x-axis tick labels, etc.
  if do_percent:
    ax.set_ylabel('Population (% Annual Total)')
  else:
    ax.set_ylabel('Population (1000s)')
  ax.set_xlabel('Years')

That proved rather painless.

But it did point out some issues with the testing for country/region names that I currently employ. United Kingdom really should have been accepted, and results obtained. Will need to sort at some point.

        Please select a chart type:
                1: One country/region, 1-5 years, all age groups
                2: 2-5 countries/regions, 1 year, all age groups
                3: 1-5 countries/regions, 1 - 10 years, one age group
                Q: Return to main menu
        Your selection: 3

        1-5 countries/regions, 1 - 10 years, one age group

Please enter up to 5 country/region names when prompted.
        Enter 'D' as the name to indicate you are done if entering less than 5 names.
                Please enter a country/region name: Canada
                More than one country/region matched the name you entered: United Kingdom!
                More than one country/region matched the name you entered: United States of America!
                Please enter a country/region name: Germany
                More than one country/region matched the name you entered: France!
                Please enter a country/region name: d
                Please enter a year (1950-2100): 2015
                Please enter the number of years you wish plotted (1-10): 5

        Please enter the age for which you wish population data plotted (0 - 100): 67

        Plotting chart for [['Canada', 'Germany'], ['2015', 5], ['65-69']]

Better Performance?

But first, the new function we wrote makes a pass through the CSV file for every country/region in the chart. Decided to see if using a new function to get the annual totals on one pass would be significantly faster. So, I added get_annual_tot(cr_nms, years, csv_path=CSV_FL) to database/population.py. I thought about modifying population.get_crs_years_one() but felt that would make the function a little too complicated. Then I wrote another get_t3_data_v2() using the new function to duplicate the data returned by get_t3_data(). Then I added a test to chart.py to time 10 iterations of each function using the same data. Turned out the second approach was actually a touch slower. For the love of me I don’t know why. But I don’t really care enough to figure it out.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/chart/chart.py -t 5
10 iterations of get_t3_data() took: 1882.20141 msec each
10 iterations of get_t3_data_2() took: 1896.79999 msec each

So, I am going to stick with the first function, get_t3_data(). Though I will leave all the new functions in my code and comitt them to version control. Just in case.

Toggling Percentage Charts

Okay, time to move this menu item from the main menu to the charting sub-menu. And, sort syncing the copy. I had to add a little extra code to prevent the sub-menu from trying to plot charts when the toggle item was selected — the two p_nms = [] lines.

PS R:\learn\py_play> git diff HEAD
diff --git a/population/population_by_age.py b/population/population_by_age.py
index d9def57..3893372 100644
--- a/population/population_by_age.py
+++ b/population/population_by_age.py
@@ -55,13 +55,13 @@ do_percent = False

 MENU_M = {
   'A': 'About',
-  '%': 'Toggle percentage plots',
   'C': 'Plot chart',
   'S': 'Search country/region names',
   'X': 'Exit the application'
 }

 MENU_C = {
+  '%': 'Toggle percentage plots',
   '1': 'One country/region, 1-5 years, all age groups',
   '2': '2-5 countries/regions, 1 year, all age groups',
   '3': '1-5 countries/regions, 1 - 10 years, one age group',
@@ -124,7 +124,10 @@ def do_chart_menu():
     else:
       print(f"\n\tPlease select a chart type:")
     for key, value in MENU_C.items():
-      print(f"\t\t{CMITEM}{key}{TRESET}: {value}")
+      if key == '%':
+        print(f"\t\t{CMITEM}{key}{TRESET}: {value} {TRED+'Off'+TRESET if do_percent else TGREEN+'On'+TRESET}")
+      else:
+        print(f"\t\t{CMITEM}{key}{TRESET}: {value}")
     user_choice = input("\tYour selection: ").upper()
     choice_ok = user_choice in MENU_C.keys()
     add_valid = not choice_ok
@@ -432,10 +435,7 @@ def do_main_menu(print_ttl=False):
     else:
       print(f"\n\tPlease make a selection:")
     for key, value in MENU_M.items():
-      if key == '%':
-        print(f"\t\t{CMITEM}{key}{TRESET}: {value} {TRED+'Off'+TRESET if do_percent else TGREEN+'On'+TRESET}")
-      else:
-        print(f"\t\t{CMITEM}{key}{TRESET}: {value}")
+      print(f"\t\t{CMITEM}{key}{TRESET}: {value}")
     user_choice = input("\tYour selection: ").upper()
     choice_ok = user_choice in MENU_M.keys()
     add_valid = not choice_ok
@@ -444,6 +444,7 @@ def do_main_menu(print_ttl=False):
 if __name__ == '__main__':
   do_ttl = True
   while True:
+    p_nms = []
     u_choice = do_main_menu(print_ttl=do_ttl)
     do_ttl = False
     if u_choice.upper() == 'X':
@@ -452,13 +453,9 @@ if __name__ == '__main__':
     elif u_choice.upper() == 'A':
       disp_about()

-    elif u_choice.upper() == '%':
-      do_percent = not do_percent
-
     elif u_choice.upper() == 'C':
-      chart.do_percent = do_percent
       while True:
         c_choice = do_chart_menu()
         if c_choice.upper() == 'Q':
@@ -473,6 +470,10 @@ if __name__ == '__main__':
           elif c_choice == '3':
             p_data = do_chart_mm1()
             p_nms, p_yrs, p_grp = p_data
+          elif c_choice == '%':
+            do_percent = not do_percent
+            chart.do_percent = do_percent
+            p_nms = []
           else:
             print(f"What the\u8230?")
           # if we have user choices for all values, send to plot function
PS R:\learn\py_play>

Name Search/Verification

Back to the problem I mentioned above regarding names like United Kingdom or United States of America. In get_cr_name() I use database.rc_names.find_names() to check if the name provided by the user is in the file data/cr_list.txt. That is the file with the list of all the country/region names in the population data CSV file, one per line. The function find_names() returns a list of all the names that match the user submitted value. Where match means, the user submitted characaters match the characters at the start of the country/region name in the list.

I then check the returned value to see how many names was returned. If only one, I accept the returned name. If more than one, I reject the user provided value. If an empty list is returned I advise the user their entry nor anything like it was not found.

The problem is that for the names above, the function returns at least two names. E.G. United Kingdom and United Kingdom (and dependencies). Ditto for France and United States of America. Perhaps more, but at least two. So, my code rejects the user submitted name text.

The solution I propose is to see if the user supplied name exactly matches (lowercase) any of the entries in the returned list of matching names. If so, I will accept the matching name returned by the search function — which will be properly cased.

A generator expression is used to help keep the in check memory efficient. But, it does not return the desired item. So if it finds the desired match, a second loop is required to get the properly cased name. I am also using casefold() instead of lower() as the former is better able to handle language variations. The relevant section of the function now looks like this:

    # search for name in list of countries/regions
    nms_fnd = rcn.find_names(cr_nm)
    #print(nms_fnd)
    # if only one matching name found, we are good to go
    if len(nms_fnd) == 1:
      cr_nm = nms_fnd[0]
      is_done = True
    else:
      if cr_nm.casefold() in (name.casefold() for name in nms_fnd):
        for name in nms_fnd:
          if cr_nm.casefold() == name.casefold():
            cr_nm = name
            is_done = True
            break
      elif len(nms_fnd) > 1:
        print(f"{ONELU}{CLREOF}\t\tMore than one country/region matched the name you entered: {TRED}{cr_nm}{TRESET}!")
      else:
        print(f"{ONELU}{CLREOF}\t\tUnable to find {TRED}{cr_nm}{TRESET} in the database!")

That seems to work the way I would like it to. At least with some simple testing.

terminal window showing name verification working properly

I won’t bother including an image of the resulting chart. It looked correct.

This One Done

And I think that’s about it for this one. Not sure where we will go next. But, will likely need to start a new project. Or do something different with this one. Will have to do some serious thinking.

So, will merge this branch with the master and call it a day.

And, for now, back to a weekly posting schedule. Next one a week from now.

Until then…

Resources

Python string method: casefold()
PEP 289 – Generator Expressions
Case insensitive ‘in’
How to check if a string is in a list of strings, ignoring case, in Python

Too Old To Code

Percentage Population Plots: Continued