Normal Curve — Bah Humbug!

Though I played with plotting an estimated normal PDF over our the histogram of China’s population in 2011, I should be really clear about things. There is no way in the world that a country’s population distribution matches a that of a normal distribution. And, I am also, not going to attempt to determine if any known data distribution does fit the histogram we generate below.

You typically can’t have less people born at some point in the past than there currently are in any given age group. Somewhere in the past the 0-4 age group had to have at least the same number of people as any older group does in the year you are looking at. For any given year, the lower ages may have less people in them than later ages. But that is only because there must have been a decline in birth rate in the recent past or some significant influx of new residents. E.G. the histogram for Japan in 2011, which is am guessing is due to a declining birth rate rather than an influx of immigration.

histogram of population of Japan, 2011, by age group

In general, one would expect that a given sample of 1 year olds is going to decline in number over time. The rate at which they do so will depend on their country’s health care system, political climate, their food supply, the quality of their diet, regional strife or warfare, extreme events in the area (e.g. drought, repeated hurricanes, locust infestations), etc. In the end, the vagaries of age will simply take over and ensure a steady decline in numbers.

Do note, that I am ignoring mass migrations. Normal immigration/repatriation levels will have some impact, but probably not significant in this view of things. Unless, perhaps, if combined with a declining birthrate. And, of course, some years the population values are estimates (if not most years). So, there is also going to be some statistical error in the bar heights. All get’s rather messy.

For interest’s sake, I am going to generate a histogram following an age group over time. At least I hope I am. So, I will take the 40-44 age group in Japan’s 2011 population. Go back in time to get them when they were in the 0-4 age group. Then plot their total population every 5 years thereafter. Might take me a bit of coding, but it might be fun to have a look. For the purposes of this exercise, I am starting a new module, pop_over_time.tst.py(), in my play folder.

And, here’s the result. We’ll get into the code shortly.

histogram of population, Japan, for age group 0-4 in 1971 over time

Plotting Population Over Time

Getting Data

To start with I wrote a function that called database.population.get_crs_years_one() repeatedly to get the population count for the specified country, year and age group. Incrementing the year (+5) and age group (+1) appropriately until I had 21 years of data. This went rather slowly. As I decided to look at other countries, I added a new function to database.population, get_grp_105_yrs(), which did the work with a single pass through the CSV file and terminated when it had all the data for the country in question. I added a quick test to pop_over_time.tst.py() to time the two approaches. The result shows the single pass was effectively 21 times faster. Which makes sense as in the repeated call case the file had to be read to a similar location 21 times rather than just once. So, I won’t bother showing you the code for the first approach.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/pop_over_time.test.py -c japan -t 2

Test #2: Timing Test of 2 Functions to Get Data

1 execution of get_pop_sequence_years(Japan, 1971):
        time diff: 1603054024.7100115 - 1603053979.0318797 = 45.67813181877136
        time_ns diff: 1603054024710011500 - 1603053979031879800 = 45678131700

1 execution of pdb.get_grp_105_yrs(Japan, '1971', LABELS):
        time diff: 1603054026.976006 - 1603054024.7100115 = 2.2659945487976074
        time_ns diff: 1603054026976006100 - 1603054024710011500 = 2265994600

Let’s start with the new function in database/population.py, get_grp_105_yrs(). It takes 3 parameters, the country, the first year to plot and a list of age group labels (they are not readily available in database/population.py.) You may recall we added the function get_agrp_lbls() to chart/chart.py in our earlier development. I thought about adding/moving that function to database/population.py, but for now am sticky with the status quo. Needless to say the function will have a lot in common with the one I was calling, get_crs_years_one(), in my initial function (the slow one).

def get_grp_105_yrs(cr_nm, s_yr, g_lbls, csv_path=CSV_FL):
  """ get population for grp 0-4 in specified s_yr for specified cr_nm
      then next higher group 5 years later, until all groups (0-4 ... 100+) covered
      Returns a dictionary of population counts keyed on year
  """
  p_data = {}
  fnd_1st = False
  curr_grp = 0
  curr_yr = s_yr

  csv_fl = open(csv_path, 'r')
  # csv reader returns a reader object which iterates over the lines of the csv file
  r = csv.reader(csv_fl, delimiter=',', quotechar='"')
  for row in r:  
    spnr.prn_spin()
    found = cr_nm.lower() == row[X_NM].lower() and row[X_YR] == curr_yr and row[X_AGE_GRP] == g_lbls[curr_grp]
    if found:
      if not fnd_1st:
        fnd_1st = True
      #during development, I added the group label to the dictionary key to make sure I was getting what I wanted
      #p_data[f"{curr_yr}:{g_lbls[curr_grp]}"] = float(row[X_TOT_POP])
      p_data[curr_yr] = float(row[X_TOT_POP])
      curr_grp += 1
      curr_yr = str(int(curr_yr)+5)
    else:
      if fnd_1st and (cr_nm.lower() != row[X_NM].lower() or curr_grp >= 21):
        break
      else:
        pass

  csv_fl.close()

  spnr.clr_spin()

  return p_data

# had a bit of trouble with the spinner import, we've seen this before, so modified imports at top of file 
import csv
if __name__ == '__main__':
  import spinner as spnr
else:
  from database import spinner as spnr

# and I added a new test with all the appropriate code changes elsewhere (e.eg. adding do_tst_4 = True)
  if do_tst_4:
    LABELS = ['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44',
              '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89',
              '90-94', '95-99', '100+']
    yr_data = get_grp_105_yrs("Japan", "1971", LABELS)
    print(yr_data)
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/database/population.py
{'1971': 9219.843, '1976': 9217.54, '1981': 9300.141, '1986': 9236.192, '1991': 9103.357, '1996': 9087.355, '2001': 9059.291, '2006': 9045.076, '2011': 9048.184, '2016': 9006.478, '2021': 8945.681, '2026': 8827.776, '2031': 8656.851, '2036': 8416.38, '2041': 8087.507, '2046': 7613.719, '2051': 6854.456, '2056': 5643.44, '2061': 3935.768, '2066': 2008.193, '2071': 637.012}

Some Set-up Required

Ok, we’ve got the data, let’s plot a histogram. First I need to do a little set up work in my file. This is a result of the way I have my working directory structured. Will see if I can fix that some time in the future. My file initially looks like this:

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime

"""pop_over_time.tst.py()

  Plot a histogram of a single age group over a period of years
  Initial parameters are a given age group and year. The first year will be calculated
  from the parameters. It is assumed we will start with the group 0-4 in the calculated year,
  then get the data for the next group 5 years later.
"""

if __name__ == '__main__':
  # if running module directly, need to set system path appropriately so that Python can find local packages
  import sys
  from pathlib import Path
  file = Path(__file__).resolve()
  parent, root = file.parent, file.parents[1]
  sys.path.append(str(root))
  # Additionally remove the current file's directory from sys.path
  try:
    sys.path.remove(str(parent))
  except ValueError: # Already removed
    pass

# pylint: disable=wrong-import-position
from chart import chart
from database import population as pdb
from database import rc_names as rcn

It is my intention to allow the specification of country and the year & age group of interest from the command line. So have added the following argparse code to my local namespace block. Have set some defaults so script can run without any arguments being passed in. You will also see that I have the two basic tests discussed so far in the test configuration variables. Though we obviously haven’t yet looked at any of that code.

if __name__ == '__main__':
  import argparse

  LABELS = chart.get_agrp_lbls()

  tst_desc = ['na', "Plot Histogram of Country's Population for an Age Group Over Time", "Timing Test of 2 Functions to Get Data"]
  max_tst = len(tst_desc) - 1
  # defaults
  do_tst = 1
  do_nm = 'Germany'
  do_grp = '40-44'
  do_yr = '2011'

  # define valid arguments, check for test number, country, year and age group on command line
  parser = argparse.ArgumentParser()
  # long name preceded by --, short by single -, get integer if appropriate
  parser.add_argument('--do_test', '-t', type=int, help=f'test number (1-{max_tst})')
  parser.add_argument('--country', '-c', help='country whose data to plot')
  parser.add_argument('--init_grp', '-g', help='age group of interest')
  parser.add_argument('--grp_yr', '-y', type=int, help='year of interest for selected group')

  args = parser.parse_args()

  if args.do_test and (args.do_test >= 1 and args.do_test <= max_tst):
    do_tst = args.do_test
  if args.country:
    if (chk_nm := chk_name(args.country)) != '':
      do_nm = chk_nm
  if args.init_grp:
    if args.init_grp in LABELS:
      do_grp = args.init_grp
  if args.grp_yr:
    if args.grp_yr >= 1950 and args.grp_yr <= 2100:
      do_yr = str(args.grp_yr)

You may have noticed the funtion chk_name() in the code above. We haven’t yet written that, but because we are getting country names from the user, we need to determine whether or not they resolve to a valid country name in our CSV file. In our earlier work we checked/validated country names in the user interface code we used to get such names from the user. We aren’t going to be using those functions here, so we need a new function. It should likely go in the one of the database modules so it is available to other modules, but for now I am just going to put it in this test module. Give it a shot and then have a look at my attempt.

def chk_name(cr_nm):
  nms_fnd = rcn.find_names(cr_nm)
  # return empty string if validation fails
  nm_fnd = ''
  # if only one matching name found, we are good to go
  if len(nms_fnd) == 1:
    nm_fnd = nms_fnd[0]
  else:
    # otherwise try to find a matching name in those returned by the search
    if cr_nm.casefold() in (name.casefold() for name in nms_fnd):
      for name in nms_fnd:
        if cr_nm.casefold() == name.casefold():
          nm_fnd = name
          break
  return nm_fnd  

Plotting the Data

The first thing we need to do is convert the year and age group passed in to the appropriate starting year for the age group ‘0-4’. That is because the user may be interested in seeing the population since birth for the age group ‘40-44’ in the year ‘2011’. So, we need to start our chart for the age group ‘0-4’ in ????. Well pretty straightforward actually. We need to start in 2011 - 40 or 1971. Once we have that sorted for our input data, we can call our function to get the necessary data. And, then plot it. So, let’s get the easy part done. See you back here when you’re done coding.

# a new function in the module's scope, figured better to put in function
# than in code in test block, may be useful elsewhere
def calc_yrs_to_zero(a_grp):
  if a_grp == '100+':
    return 100
  else:
    return int(a_grp.split('-')[0])

# in the test block

  # get start year for our histogram
  # this back and forth with the int() and str() ifying of year would seem to indicate a flaw in my design
  yr_to_zero = calc_yrs_to_zero(do_grp)
  yr_0 = int(do_yr) - yr_to_zero

  # then check test number and proceed accordingly
  if do_tst == 1:
    print(f"\nTest #{do_tst}: {tst_desc[do_tst]}\n")
    print(f"Initial criteria: country is {do_nm}, specified group and year is {do_grp} in {do_yr}")
    print(f"Plot start criteria: country is {do_nm}, first group and year is '0-4' in {yr_0}")

    yr_data = pdb.get_grp_105_yrs(do_nm, str(yr_0), LABELS)
    print('\n',yr_data)

I did not put the year zero calculation in the test block, because it will likely be needed for most tests. So I left it in the higher level scope. And, using argument defaults where appropriate, I get the following output from the above code.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/pop_over_time.test.py -c Japan

Test #1: Plot Histogram of Country's Population for an Age Group Over Time

Initial criteria: country is Japan, specified group and year is 40-44 in 2011
Plot start criteria: country is Japan, first group and year is '0-4' in 1971

 {'1971': 9219.843, '1976': 9217.54, '1981': 9300.141, '1986': 9236.192, '1991': 9103.357, '1996': 9087.355, '2001': 9059.291, '2006': 9045.076, '2011': 9048.184, '2016': 9006.478, '2021': 8945.681, '2026': 8827.776, '2031': 8656.851, '2036': 8416.38, '2041': 8087.507, '2046': 7613.719, '2051': 6854.456, '2056': 5643.44, '2061': 3935.768, '2066': 2008.193, '2071': 637.012}

Now, to plot the data in a nice histogram. I unfortunately still don’t follow how the hist() function really works. And, to get it to work nicely with dates along the x-axis took me a fair bit of sorting out. I will refer you to the links in the Resources section. After much playing about, I ended up having to convert my year strings, the x-axis labels, into dates. Then using a locator and a formatter to get things displayed ever so nicely in the plot. I also had to mess with the align argument to get the bars where I wanted them. I do not really understand how that parameter works. Will need to do more reading.

Turns out there was another issue causing some of the problems. When I looked at the first plots I was generating, the years on the x-axis were not lining up correctly with the bars. Took me a bit of reading and a fair bit of thinking spread out over the best part of a day. During, what should have been a restful night’s sleep, something occurred to me. The chart I had produced was cramming the histogram into a compressed range. Here’s a plot showing the bin edges as determined by hist(). You can see how they drift away from aligning with the correct years: [1971, 1976, 1981, 1986, …, 2066, 2071]. And how the bars are confined between 1971 and 2071. The latter was a big clue.

failed or bad histogram of population, Japan, for age group 0-4 in 1971 over time showing bin edges

Not going to bother showing the code for plotting the bin edges. I am sure you know how to do that. The setting for the align parameter would affect how the compression looked. But, I recalled that the documentation said:

So my code for the histogram was using x.min() = 1971 and x.max() = 2071. But, my bars cover 5 year intervals, so I needed extra space on either side to provide room for the true width of the bars. The hist() function wouldn’t know that from the info it was getting. So, I had to provide my own value(s) for the range parameter in order to provide the necessary space for the histogram.

So, here’s my version of a hist_pop_overtime(). Sorry about the name, didn’t feel like typing out the first two words in full.

def hist_pop_overtime(fig, ax, cr_nm, p_grp, p_yr, b_lbls, wts):
  # Because we may wish to add other items to the plot, creation of subplot is left to the user, so we have parameters for fig and ax
  # cr_nm (country name), p_grp (group of interest), p_yr (year of interest for group of interest),
  # b_lbls (bin labels, in this case the years) and wts (population counts by incrementing age group for that country and year)

  # for our locator and formatter need dates  
  x_lbls = [ datetime.datetime(int(yr), 1, 1) for yr in b_lbls]
  # because I know my bin widht is consistent
  half_bin = (int(b_lbls[1]) - int(b_lbls[0])) / 2
  yr_adj = math.floor(half_bin)
  mon_adj = int((half_bin - yr_adj) * 12)
  if mon_adj > 0:
    h_range = (datetime.datetime(int(b_lbls[0]) - yr_adj - 1, mon_adj, 1), datetime.datetime(int(b_lbls[-1]) + yr_adj, mon_adj, 1))
  else:
    h_range = (datetime.datetime(int(b_lbls[0]) - yr_adj, 1, 1), datetime.datetime(int(b_lbls[-1]) + yr_adj, 1, 1))
  
  plt.xlabel('Years')
  plt.ylabel('Population (1000s)')
  plt.title(f'{cr_nm}: History for {p_grp} in {p_yr}')

  locator = mdates.YearLocator(base=5)
  ax.xaxis.set_major_locator(locator)
  ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))  

  return ax.hist(x_lbls, weights=wts, bins=len(b_lbls), align='mid', range=h_range)

# and in our test block
    # do recall we need to provide lists to our function not an iterable
    b_lbls = list(yr_data.keys())
    b_vals = list(yr_data.values())
    fig, ax = plt.subplots(figsize=(15,9))
    n, bins, patches = hist_pop_overtime(fig, ax, do_nm, do_grp, do_yr, b_lbls, b_vals)

    plt.show()

And, here’s what I get for my ouput for Japan, 40-44 in 2011. Still showing the bin edges. Note how the bars are now pretty much centred over the appropriate year. Also, note that align-'mid' works once there is enough room in the plot’s range. Haven’t yet found a way to get hist() to use years 1966 1971, …. Not sure that is really necessary given we are going to add annotations on mouse clicks.

proper histogram of population, Japan, for age group 0-4 in 1971 over time showing bin edges

Adding Annotations on Mouse Click

I also wanted to get the data for each bar displayed on a mouse click. So, I added functions for that purpose to this file as well. No harm in doing something a few times to get the hang of it. You will, perhaps, recall that in our previous code for annotations we needed the list of bars we created when generating our bar charts. That list allowed us to find the correct bar and obtain the height of the bar to include in our annotation. We don’t actually generate such a list when coding our histogram, but the histogram function does return a list of rectangle objects which will do the same job. The line of code to generate the histogram using our function looks like:

n, bins, patches = hist_pop_overtime(fig, ax, do_nm, do_grp, do_yr, b_lbls, b_vals)

If we print out the values of n, bins and patches we get something like:

n: [9219.843 9217.54  9300.141 9236.192 9103.357 9087.355 9059.291 9045.076
 9048.184 9006.478 8945.681 8827.776 8656.851 8416.38  8087.507 7613.719
 6854.456 5643.44  3935.768 2008.193  637.012]
bins: [719528.         721267.28571429 723006.57142857 724745.85714286
 726485.14285714 728224.42857143 729963.71428571 731703.
 733442.28571429 735181.57142857 736920.85714286 738660.14285714
 740399.42857143 742138.71428571 743878.         745617.28571429
 747356.57142857 749095.85714286 750835.14285714 752574.42857143
 754313.71428571 756053.        ]
patches: [<matplotlib.patches.Rectangle object at 0x000002212D540220>, <matplotlib.patches.Rectangle object at 0x000002212D540430>, ...
# and if we print the first Rectangle object, we get
rectangle object (patches[0]): Rectangle(xy=(720398, 0), width=1739.29, height=9219.84, angle=0)

And, the contains() method also works with rectangles. So, an approach similar to what we used the first time, but with a different list of objects should pretty much work. I am also sticky with the small bar check we used the previous time as well. And, I added the code to generate a picker so that we can close an annotation by clicking on it. For the latter, I used the original function in the chart package. All very much like the code we used the first time. Give it a shot and see you back here shortly.

def hist_cb_click(ax, fig, rects, plot_for, b_lbls, a_grps):
  """Create callback for click event. Annotation to stay visible until it is clicked
     See also: op_click(event)
     Need to create new annotation for each click, unlike the mouse hover effect
     Issues: var 'ax' not available to op_click(), so updating chart in callback(), not ideal!
  """

  def callback(event):
    # create new annotation each time so it stays on chart until removed with picker
    annot = ax.annotate(f"", xy=(0,0), xytext=(-5,20), textcoords="offset points", color='white', zorder=99.9,
                        bbox=dict(boxstyle="round", fc="black", ec="b", lw=2, alpha=0.8),
                        arrowprops=dict(arrowstyle="->"), picker=True)
    annot.set_visible(False)
    
    is_small_bar = False
    is_visible = annot.get_visible()
    
    if event.inaxes == ax:
      j = 0
      for rect in rects:
        cont, _ = rect.contains(event)
        if not cont:
          is_small_bar = (event.xdata >= rect.get_x()) and (event.xdata < (rect.get_x() + rect.get_width())) and (event.y <= 100)
        if cont or is_small_bar:
          b_ht = rect.get_height()
          x = rect.get_x() + rect.get_width()/2.
          y = rect.get_y() + b_ht
          annot.xy = (x,y)
          text = f"{plot_for}: {b_lbls[j]} ({a_grps[j]})\n{b_ht:.3f}"
          # percentages don't really make sense for this type of chart
          text += " (000s)"
          annot.set_text(text)
          annot.set_visible(True)
          fig.canvas.draw_idle()
          return
        j += 1
  return callback

And in our test block I added the following.

    ...
    n, bins, patches = hist_pop_overtime(fig, ax, do_nm, do_grp, do_yr, b_lbls, b_vals)
    onclick = hist_cb_click(ax, fig, patches, do_nm, b_lbls, a_grps=LABELS)
    c_id = fig.canvas.mpl_connect('button_press_event', onclick)
    p_id = fig.canvas.mpl_connect('pick_event', chart.pick_create(fig))

    plt.show()
    ...

I am not going to add another image of the plot with annotations. Please refert to the image near the top of the post, just above the Plotting Population Over Time section.

Command Line Argument Fix

I had planned to add a few images showing the age over time plot for a number of countries. But then I thought that if you are coding this as I/we go along, then you can do that yourself — selecting the countries or regions in which you are personally interested. But, when testing my plot for a variety of countries I tried new zea. And, got an error message:

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/pop_over_time.test.py -c new zea
usage: pop_over_time.test.py [-h] [--do_test DO_TEST] [--country COUNTRY] [--init_grp INIT_GRP] [--grp_yr GRP_YR]
pop_over_time.test.py: error: unrecognized arguments: zea

Turns out that unless told otherwise, argparser only accepts a single character string for any given string argument. How do you tell it to accept longer strings? See: argparse: nargs. So, I modified my code to add the nargs argument to my call to add_argument. With multi-word arguments allowed, a list of the words submitted by the user is returned. So, in our case we need to convert that list to a space separated string before doing our search of the CSV file. Something like:

...
# in our argument definition code
# allow multiple words for a country name
  parser.add_argument('--country', '-c', help='country whose data to plot', nargs='+')
...
...
# and in our agrument processing code
  if args.country:
    full_nm = ' '.join(args.country)
    chk_nm = full_nm
    if (chk_nm := chk_name(full_nm)) != '':
      do_nm = chk_nm
...
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/pop_over_time.test.py -c new zea

Test #1: Plot Histogram of Country's Population for an Age Group Over Time

Initial criteria: country is Germany, specified group and year is 40-44 in 2011
Plot start criteria: country is Germany, first group and year is '0-4' in 1971

Still not quite what I wanted. The dreaded “(and dependencies)” situation. But, the full name did the job.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/pop_over_time.test.py -c new zealand

Test #1: Plot Histogram of Country's Population for an Age Group Over Time

Initial criteria: country is New Zealand, specified group and year is 40-44 in 2011
Plot start criteria: country is New Zealand, first group and year is '0-4' in 1971

‘Til Next Time

That’s it for another one. These posts on descriptive statistics and the related coding are sure taking me longer than most of the earlier ones. Not sure why. A little longer perhaps. Maybe more complicated. And, getting older I guess — had my 70th birthday while working on this one. See you in a week.

A final note. I do believe this side trip really didn’t help much with the discussion of descriptive statistics or data distributions. But, at least it helped me get a better understanding of the hist() functions workings. I probably should just have used the bar plotting functions to generate my own histograms, but one doesn’t learn anything without trying new things. And, it did get me thinking about other things in relation to the data we have available in the CSV file.

Resources