Module Code: est_avg_age.py

As promised here’s the current state of the code in my module population/est_avg_age.py. This is the module I added in order to track (Git) my code for generating and plotting the simulations for estimating the world average age for a given year. The test module I had been using to develop the code was not being tracked. Certainly didn’t want to risk losing my work to date.

Subsequent to the post covering development of this code, I added another test/plot/experiment to facilitate testing, Plot Single Statistic, One Histogram.

A number of things had to change once I started testing. I was using a set of constants and a set of module level variables. The latter were initially set to the value of a constant. And, modified if a suitable command line argument was passed in. I was defining some optional parameters in my function definitions. This wasn’t working correctly if I later changed one of the variables used to define the optional parameters default value via command line arguments. The functions fixed the value of the default to whatever it was at the time the function was defined. Changing the value of the variable later, had no affect on what the function used. So I dropped the optional parameters for the function definitions, requiring that they be submitted each time. In the case of plot_hist() I also changed the order of the parameters.

The whole process involved a good number of variable name changes throughout the module not just in both function definitions.

I also manually added the line World,World,900.0,World to the end of the data\mod_country-continent.csv file. That way I could, for the given year, get the world average based on the data in the the data\WPP2019_PopulationByAgeSex_Medium.csv file. I have added the display of that value to my command line output.

Cat woke me up during the night following my writing the above (and making the code changes). Got thinking and realized what I had done would affect the sampling. That was because the “World” would be in the dataframe of countries and could be selected as one of the random “countries”. Definitely not what I wanted. So, I have removed the line from the CSV. Now I just add “World” to the list of countries for which I wish to get data. And proceed accordingly. I have updated the code below. And, the sample test run.

You will also likely notice I have already planned for a couple of additional potential experiments.

"""Code for estimating the world's average (median) age by random sampling.
   Tied to blog post, world_avg_age_*/*.

   This is version 2, needed to rethink how I was doing things due to the time my sample generation
   could take. It is based on play/est_avg_age.test.py

   Planning to try and do it all in memory rather than repeated i/o.

   Will be repeating code in a lot of other modules, but, no way around that while I do my testing

   Actually !! going to generate a new csv file that has country and mod continent column from the latest csv file
   plus a new valid country name column that has the proper name from the population database csv. Will
   use that csv to generate random names for simulation.

File: py_play/population/est_avg_age.py

Module level and/or global variables:
----

  _CSV_NM: database population data CSV = 'WPP2019_PopulationByAgeSex_Medium.csv'
  _LST_NM: old country/region name list/CSV = 'cr_list.txt'
  _LST_CC_NM: new country/region name + locID list/CSV = 'cr_code_list.csv'
  _CSV_COUNTRY: downloaded country, continent list/CSV = 'country-continent-codes.csv'
  _CSV_MOD_CC: new country, continent list/CSV with valid names from _LST_CC_NM = 'mod_country-continent.csv'

  _BLD_DB_CC_CSV: build or rebuild revised country/locID CSV = False
  _BLD_MOD_CSV: build or rebuild modified country-continent CSV = False
  _DO_SAMPLING: do a sampling run/test = True

  _S_SIZE: the number of countries to randomly select for each sample, default 5
  _REPEATS: the number of times to repeat our complete process (random countries, calculations, etc) before
            we look to see what we have determined -- if anything, default 100

  _LABELS: a list of the label for each bin in our country data
  _L_LOWS: a lits of the lower boundaries for the above labels (the quartile calculation function needs those)

  use_yr: the year for which to estimate average age, default 2011
  n_tst: which test to run, default = 1
  t_size: actual sample size to use for test, default = _S_Size
  t_rpt: actual number of repeitions to use for test, default = _REPEATS
  t_stat: if applicable, which sample statistic is to be displayed = 1

Functions:

  get_sample(r_nbr): get data for r_nbr countries, generate and return sample statistics
  plot_hist(ax, h_data, s_yr, s_type, c_size, s_rep): generate est avg age histogram in specified Axes using supplied data

"""
import csv
import pathlib
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import time
import argparse

# Pylance: disable=reportMissingImports
#import matplotlib.pyplot as plt
#from scipy import stats
from chart import chart
from chart import descriptive_stats as dstat
from database import population as pdb
from database import rc_names as rcn

#import math

# Test for some of the most likely starting working directories.
# If found set DATA_DIR accordingly. Using pathlib to cover operating system issues.
# If not one of them print message and exit.
D_PATH = ''
p = pathlib.Path.cwd()
if p.name == 'py_play':
  D_PATH = pathlib.Path('./data')
elif p.name == 'population':
  D_PATH = pathlib.Path('../data')
elif p.name == 'database':
  D_PATH = pathlib.Path('../../data')
else:
  print(f"\nCurrent working directory is an unexpected location, {p}!")
  print(f"Please run the application from the root directory. Quitting application")
  exit(1)

# module constants
_CSV_NM = 'WPP2019_PopulationByAgeSex_Medium.csv'
_LST_NM = 'cr_list.txt'
_LST_CC_NM = 'cr_code_list.csv'
_CSV_COUNTRY = 'country-continent-codes.csv'
_CSV_MOD_CC = 'mod_country-continent.csv'

_BLD_DB_CC_CSV = False
_BLD_MOD_CSV = False
_DO_SAMPLING = True

_S_SIZE = 5
_REPEATS = 100

_LABELS = chart.get_agrp_lbls()
_L_LOWS = dstat.get_grp_low_bdrys(_LABELS)

# define some module level, empty/default variables

df_country = pd.DataFrame(columns = ['continent','country','valid'])
valid_country = []
c_data  = {}
n_tst = 1
t_size = _S_SIZE
t_rpt = _REPEATS
t_stat = 0
use_yr = '2011'

def get_sample(r_nbr):
  global df_country, c_data
  s_medians = {}
  mean_ms = 0
  wt_mean_ms = 0
  sum_median = 0
  c_pops = []
  sum_pop = [0 for _ in range(21)]
  c_tot = {}
  
  for _ in range(r_nbr):
    rand_r = df_country.sample()
    c_nm = rand_r['valid'].iloc[0]
    s_data = c_data[c_nm]
    c_pops.append(s_data)
    c_tot[c_nm] = sum(s_data)
    s_median = dstat.get_wt_quartile(_L_LOWS, c_data[c_nm])
    s_medians[c_nm] = s_median

  mean_ms = sum(s_medians.values()) / r_nbr
  wt_medians = [s_medians[nm] * c_tot[nm] for nm in s_medians.keys()]
  wt_mean_ms = sum(wt_medians) / sum(c_tot.values())
  sum_pop = [sum(n) for n in zip(*c_pops)]
  sum_median = dstat.get_wt_quartile(_L_LOWS, sum_pop)
 
  return (mean_ms, wt_mean_ms, sum_median)


def plot_hist(ax, h_data, s_yr, s_type, c_size, s_rep):
  ax.set_xlabel(f'Age ({s_type})')
  ax.set_title(f"Simulation: {c_size} countries X {s_rep} repetitions")
  n, bins, patches = ax.hist(h_data, bins=range(0, 60))

  s_mean = statistics.mean(h_data)
  s_stdev = statistics.stdev(h_data)
  std_err = 1.96 * s_stdev / (s_rep**0.5)
  mean_p95 = s_mean + std_err
  mean_m95 = s_mean - std_err

  ax.axvline(s_mean, 0, 1, color='c', label=f'Est. Average Age: {s_mean:.2f}')
  ax.axvline(mean_m95, 0, 1, color='g', label=f'95% conf interval: {mean_m95:.2f}')
  ax.axvline(mean_p95, 0, 1, color='m', label=f'95% conf interval: {mean_p95:.2f}')
  # not actually plotting anything as such, just want value to show in legend area
  ax.axvline(s_stdev, 0, 0, color='k', label=f'std dev: {s_stdev:.2f}')
  ax.legend()


if __name__ == '__main__':
  tst_desc = ['na',
              "Plot Histograms of 3 Types of Sample Data",
              "Plot 4 histograms showing sample statistic for various repetitions",
              "Plot 4 histograms showing sample statistic for various sample sizes",
              # following should always be the last test
              "Plot Single Statistic, One Histogram"
  ]
  max_tst = len(tst_desc) - 1
  smpl_types =['Simple Mean of Sample Medians',
              'Weighted Mean of Sample Medians',
              'Median of Summed Sample Data'
  ]
  max_smpl = len(smpl_types) - 1
  max_size = 50
  max_rpt = 10000
  # defaults

  # define and get command line arguments
  parser = argparse.ArgumentParser()
  # long name preceded by --, short by single -, get it as an integer so can use to access test data array
  parser.add_argument('--do_exp', '-e', type=int, help=f'Which experiment to run (1-{max_tst})')
  parser.add_argument('--c_cnt', '-c', type=int, help=f'Number of countries to use per randome sample (1-{max_size})')
  parser.add_argument('--rpts', '-r', type=int, help=f'How many sample repetitions to use for current run (1-{max_rpt})')
  parser.add_argument('--stat', '-s', type=int, help=f'Which sample statistic to display (1-{max_smpl}), if applicable')
  parser.add_argument('--year', '-y', help=f'Which year to use for sample run (1950-2100)')

  args = parser.parse_args()

  if args.do_exp:
    if (args.do_exp >= 1 and args.do_exp <= max_tst):
      n_tst = args.do_exp
    else:
      _DO_SAMPLING = False
  if args.c_cnt and (args.c_cnt >= 1 and args.c_cnt <= max_size):
    t_size = args.c_cnt
  if args.rpts and (args.rpts >= 1 and args.rpts <= max_rpt):
    t_rpt = args.rpts
  if args.stat and (args.stat >= 1 and args.stat <= max_smpl):
    t_stat = args.stat - 1
  if args.year and (int(args.year) >= 1950 and int(args.year) <= 2100):
    use_yr = args.year

  if _BLD_DB_CC_CSV:
    s_tm = time.time()
    rcn.make_db_name_id_csv()
    e_tm = time.time()
    print(f"building {_LST_CC_NM} took {e_tm - s_tm}")
  
  if _BLD_MOD_CSV:
    s_tm = time.time()
    rcn.make_mod_country_csv()
    e_tm = time.time()
    diff_tm = e_tm - s_tm
    print(f"modifying and saving dataframe took {diff_tm:.2f}")

  if _DO_SAMPLING:

    lst_means = []
    lst_wt_means = []
    lst_summed_mdns = []
    
    sd_tm = time.time()
    
    df_country = pd.read_csv(D_PATH / _CSV_MOD_CC, comment='#', usecols = ['continent','country','valid'])

    valid_c = df_country['valid'].tolist()
    valid_c.append("World")
    valid_c.sort()

    c_data = pdb.get_crs_1yr_all(valid_c, use_yr)

    ed_tm = time.time()
    diff_tm = ed_tm - sd_tm
    print(f"loading CSV and getting data from population database took {diff_tm:.2f} ({sd_tm:.6f}-{ed_tm:.6f})")

    # get world median age for current year
    w_median = dstat.get_wt_quartile(_L_LOWS, c_data['World'])

    for _ in range(t_rpt):
      mean_mds, wt_mean_mds, summed_md = get_sample(t_size)
      #print(mean_mds, wt_mean_mds, summed_md)
      lst_means.append(mean_mds)
      lst_wt_means.append(wt_mean_mds)
      lst_summed_mdns.append(summed_md)

    # simple sample stats
    mean_means = statistics.mean(lst_means)
    sd_means = statistics.stdev(lst_means)
    serr_means = 1.96 * sd_means / (t_rpt**0.5)
    means_p95 = mean_means + serr_means
    means_m95 = mean_means - serr_means

    mean_wtd = statistics.mean(lst_wt_means)
    sd_wtd = statistics.stdev(lst_wt_means)
    serr_wtd = 1.96 * sd_wtd / (t_rpt**0.5)
    wtd_p95 = mean_wtd + serr_wtd
    wtd_m95 = mean_wtd - serr_wtd

    mean_sumd = statistics.mean(lst_summed_mdns)
    sd_sumd = statistics.stdev(lst_summed_mdns)
    serr_sumd = 1.96 * sd_sumd / (t_rpt**0.5)
    sumd_p95 = mean_sumd + serr_sumd
    sumd_m95 = mean_sumd - serr_sumd

    print(f"Random sample of {t_size} countries, repeated {t_rpt} times")
    print(f"\n{'Simple mean:':14} {mean_means:.2f}, std dev: {sd_means:.2f}, 95%: {means_m95:.2f} - {means_p95:.2f}")
    print(f"{'Weighted mean:':14} {mean_wtd:.2f}, std dev: {sd_wtd:.2f}, 95%: {wtd_m95:.2f} - {wtd_p95:.2f}")
    print(f"{'Summed median:':14} {mean_sumd:.2f}, std dev: {sd_sumd:.2f}, 95%: {sumd_m95:.2f} - {sumd_p95:.2f}")

    es_tm = time.time()
    diff_tm = es_tm - ed_tm
    print(f"\ngenerating sample stats took {diff_tm:.2f} ({ed_tm:.6f}-{es_tm:.6f})")

    h_lists = [lst_means, lst_wt_means, lst_summed_mdns]

    if n_tst == 1:
      print(f"\nTest #{n_tst}: {tst_desc[n_tst]}\n")

      fig, axs = plt.subplots(2, 2, sharey=True, gridspec_kw={'hspace': .3}, figsize=(15,8))
      axs[1, 1].axis('off')
      fig.suptitle(f"Estimated World Average Age: {use_yr}")

      p_row = 0
      for i in range(len(h_lists)):
        plot_hist(axs[p_row, i % 2], h_lists[i], use_yr, smpl_types[i], t_size, t_rpt)
        if i > 0 and i % 2 == 1:
          p_row += 1

    elif n_tst == 2:
      print(f"\nTest #{n_tst}: {tst_desc[n_tst]} -- NOT CURRENTLY AVAILABLE\n")

    elif n_tst == 3:
      print(f"\nTest #{n_tst}: {tst_desc[n_tst]} -- NOT CURRENTLY AVAILABLE\n")

    elif n_tst == max_tst:
      print(f"\nTest #{n_tst}: {tst_desc[n_tst]}\n")

      fig, ax = plt.subplots(figsize=(10,6))
      fig.suptitle(f"Estimated World Average Age: {use_yr}")
      plot_hist(ax, h_lists[t_stat], use_yr, smpl_types[t_stat], t_size, t_rpt)

    es_tm = time.time()
    diff_tm = es_tm - sd_tm
    print(f"whole sampling process took {diff_tm:.2f}")

    plt.show()

And, here’s a test run using all the command line parameters.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 4 -s 2 -c 10 -r 500 -y 2020
loading CSV and getting data from population database took 4.92 (1605536641.538051-1605536646.457991)
Random sample of 10 countries, repeated 500 times

Simple mean:   29.95, std dev: 3.08, 95%: 29.68 - 30.22
Weighted mean: 30.52, std dev: 4.93, 95%: 30.09 - 30.96
Summed median: 29.94, std dev: 5.04, 95%: 29.49 - 30.38
World median:  30.90

generating sample stats took 1.18 (1605536646.457991-1605536647.639950)

Test #4: Plot Single Statistic, One Histogram

whole sampling process took 6.73

image showing sample output from a test run of command line argument use