As promised here’s the current state of the code in my module population/est_avg_age.py. This is the module I added in order to track (Git) my code for generating and plotting the simulations for estimating the world average age for a given year. The test module I had been using to develop the code was not being tracked. Certainly didn’t want to risk losing my work to date.
Subsequent to the post covering development of this code, I added another test/plot/experiment to facilitate testing, Plot Single Statistic, One Histogram.
A number of things had to change once I started testing. I was using a set of constants and a set of module level variables. The latter were initially set to the value of a constant. And, modified if a suitable command line argument was passed in. I was defining some optional parameters in my function definitions. This wasn’t working correctly if I later changed one of the variables used to define the optional parameters default value via command line arguments. The functions fixed the value of the default to whatever it was at the time the function was defined. Changing the value of the variable later, had no affect on what the function used. So I dropped the optional parameters for the function definitions, requiring that they be submitted each time. In the case of plot_hist() I also changed the order of the parameters.
The whole process involved a good number of variable name changes throughout the module not just in both function definitions.
I also manually added the line World,World,900.0,World to the end of the data\mod_country-continent.csv file. That way I could, for the given year, get the world average based on the data in the the data\WPP2019_PopulationByAgeSex_Medium.csv file. I have added the display of that value to my command line output.
Cat woke me up during the night following my writing the above (and making the code changes). Got thinking and realized what I had done would affect the sampling. That was because the “World” would be in the dataframe of countries and could be selected as one of the random “countries”. Definitely not what I wanted. So, I have removed the line from the CSV. Now I just add “World” to the list of countries for which I wish to get data. And proceed accordingly. I have updated the code below. And, the sample test run.
You will also likely notice I have already planned for a couple of additional potential experiments.
"""Code for estimating the world's average (median) age by random sampling.
Tied to blog post, world_avg_age_*/*.
This is version 2, needed to rethink how I was doing things due to the time my sample generation
could take. It is based on play/est_avg_age.test.py
Planning to try and do it all in memory rather than repeated i/o.
Will be repeating code in a lot of other modules, but, no way around that while I do my testing
Actually !! going to generate a new csv file that has country and mod continent column from the latest csv file
plus a new valid country name column that has the proper name from the population database csv. Will
use that csv to generate random names for simulation.
File: py_play/population/est_avg_age.py
Module level and/or global variables:
----
_CSV_NM: database population data CSV = 'WPP2019_PopulationByAgeSex_Medium.csv'
_LST_NM: old country/region name list/CSV = 'cr_list.txt'
_LST_CC_NM: new country/region name + locID list/CSV = 'cr_code_list.csv'
_CSV_COUNTRY: downloaded country, continent list/CSV = 'country-continent-codes.csv'
_CSV_MOD_CC: new country, continent list/CSV with valid names from _LST_CC_NM = 'mod_country-continent.csv'
_BLD_DB_CC_CSV: build or rebuild revised country/locID CSV = False
_BLD_MOD_CSV: build or rebuild modified country-continent CSV = False
_DO_SAMPLING: do a sampling run/test = True
_S_SIZE: the number of countries to randomly select for each sample, default 5
_REPEATS: the number of times to repeat our complete process (random countries, calculations, etc) before
we look to see what we have determined -- if anything, default 100
_LABELS: a list of the label for each bin in our country data
_L_LOWS: a lits of the lower boundaries for the above labels (the quartile calculation function needs those)
use_yr: the year for which to estimate average age, default 2011
n_tst: which test to run, default = 1
t_size: actual sample size to use for test, default = _S_Size
t_rpt: actual number of repeitions to use for test, default = _REPEATS
t_stat: if applicable, which sample statistic is to be displayed = 1
Functions:
get_sample(r_nbr): get data for r_nbr countries, generate and return sample statistics
plot_hist(ax, h_data, s_yr, s_type, c_size, s_rep): generate est avg age histogram in specified Axes using supplied data
"""
import csv
import pathlib
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import time
import argparse
# Pylance: disable=reportMissingImports
#import matplotlib.pyplot as plt
#from scipy import stats
from chart import chart
from chart import descriptive_stats as dstat
from database import population as pdb
from database import rc_names as rcn
#import math
# Test for some of the most likely starting working directories.
# If found set DATA_DIR accordingly. Using pathlib to cover operating system issues.
# If not one of them print message and exit.
D_PATH = ''
p = pathlib.Path.cwd()
if p.name == 'py_play':
D_PATH = pathlib.Path('./data')
elif p.name == 'population':
D_PATH = pathlib.Path('../data')
elif p.name == 'database':
D_PATH = pathlib.Path('../../data')
else:
print(f"\nCurrent working directory is an unexpected location, {p}!")
print(f"Please run the application from the root directory. Quitting application")
exit(1)
# module constants
_CSV_NM = 'WPP2019_PopulationByAgeSex_Medium.csv'
_LST_NM = 'cr_list.txt'
_LST_CC_NM = 'cr_code_list.csv'
_CSV_COUNTRY = 'country-continent-codes.csv'
_CSV_MOD_CC = 'mod_country-continent.csv'
_BLD_DB_CC_CSV = False
_BLD_MOD_CSV = False
_DO_SAMPLING = True
_S_SIZE = 5
_REPEATS = 100
_LABELS = chart.get_agrp_lbls()
_L_LOWS = dstat.get_grp_low_bdrys(_LABELS)
# define some module level, empty/default variables
df_country = pd.DataFrame(columns = ['continent','country','valid'])
valid_country = []
c_data = {}
n_tst = 1
t_size = _S_SIZE
t_rpt = _REPEATS
t_stat = 0
use_yr = '2011'
def get_sample(r_nbr):
global df_country, c_data
s_medians = {}
mean_ms = 0
wt_mean_ms = 0
sum_median = 0
c_pops = []
sum_pop = [0 for _ in range(21)]
c_tot = {}
for _ in range(r_nbr):
rand_r = df_country.sample()
c_nm = rand_r['valid'].iloc[0]
s_data = c_data[c_nm]
c_pops.append(s_data)
c_tot[c_nm] = sum(s_data)
s_median = dstat.get_wt_quartile(_L_LOWS, c_data[c_nm])
s_medians[c_nm] = s_median
mean_ms = sum(s_medians.values()) / r_nbr
wt_medians = [s_medians[nm] * c_tot[nm] for nm in s_medians.keys()]
wt_mean_ms = sum(wt_medians) / sum(c_tot.values())
sum_pop = [sum(n) for n in zip(*c_pops)]
sum_median = dstat.get_wt_quartile(_L_LOWS, sum_pop)
return (mean_ms, wt_mean_ms, sum_median)
def plot_hist(ax, h_data, s_yr, s_type, c_size, s_rep):
ax.set_xlabel(f'Age ({s_type})')
ax.set_title(f"Simulation: {c_size} countries X {s_rep} repetitions")
n, bins, patches = ax.hist(h_data, bins=range(0, 60))
s_mean = statistics.mean(h_data)
s_stdev = statistics.stdev(h_data)
std_err = 1.96 * s_stdev / (s_rep**0.5)
mean_p95 = s_mean + std_err
mean_m95 = s_mean - std_err
ax.axvline(s_mean, 0, 1, color='c', label=f'Est. Average Age: {s_mean:.2f}')
ax.axvline(mean_m95, 0, 1, color='g', label=f'95% conf interval: {mean_m95:.2f}')
ax.axvline(mean_p95, 0, 1, color='m', label=f'95% conf interval: {mean_p95:.2f}')
# not actually plotting anything as such, just want value to show in legend area
ax.axvline(s_stdev, 0, 0, color='k', label=f'std dev: {s_stdev:.2f}')
ax.legend()
if __name__ == '__main__':
tst_desc = ['na',
"Plot Histograms of 3 Types of Sample Data",
"Plot 4 histograms showing sample statistic for various repetitions",
"Plot 4 histograms showing sample statistic for various sample sizes",
# following should always be the last test
"Plot Single Statistic, One Histogram"
]
max_tst = len(tst_desc) - 1
smpl_types =['Simple Mean of Sample Medians',
'Weighted Mean of Sample Medians',
'Median of Summed Sample Data'
]
max_smpl = len(smpl_types) - 1
max_size = 50
max_rpt = 10000
# defaults
# define and get command line arguments
parser = argparse.ArgumentParser()
# long name preceded by --, short by single -, get it as an integer so can use to access test data array
parser.add_argument('--do_exp', '-e', type=int, help=f'Which experiment to run (1-{max_tst})')
parser.add_argument('--c_cnt', '-c', type=int, help=f'Number of countries to use per randome sample (1-{max_size})')
parser.add_argument('--rpts', '-r', type=int, help=f'How many sample repetitions to use for current run (1-{max_rpt})')
parser.add_argument('--stat', '-s', type=int, help=f'Which sample statistic to display (1-{max_smpl}), if applicable')
parser.add_argument('--year', '-y', help=f'Which year to use for sample run (1950-2100)')
args = parser.parse_args()
if args.do_exp:
if (args.do_exp >= 1 and args.do_exp <= max_tst):
n_tst = args.do_exp
else:
_DO_SAMPLING = False
if args.c_cnt and (args.c_cnt >= 1 and args.c_cnt <= max_size):
t_size = args.c_cnt
if args.rpts and (args.rpts >= 1 and args.rpts <= max_rpt):
t_rpt = args.rpts
if args.stat and (args.stat >= 1 and args.stat <= max_smpl):
t_stat = args.stat - 1
if args.year and (int(args.year) >= 1950 and int(args.year) <= 2100):
use_yr = args.year
if _BLD_DB_CC_CSV:
s_tm = time.time()
rcn.make_db_name_id_csv()
e_tm = time.time()
print(f"building {_LST_CC_NM} took {e_tm - s_tm}")
if _BLD_MOD_CSV:
s_tm = time.time()
rcn.make_mod_country_csv()
e_tm = time.time()
diff_tm = e_tm - s_tm
print(f"modifying and saving dataframe took {diff_tm:.2f}")
if _DO_SAMPLING:
lst_means = []
lst_wt_means = []
lst_summed_mdns = []
sd_tm = time.time()
df_country = pd.read_csv(D_PATH / _CSV_MOD_CC, comment='#', usecols = ['continent','country','valid'])
valid_c = df_country['valid'].tolist()
valid_c.append("World")
valid_c.sort()
c_data = pdb.get_crs_1yr_all(valid_c, use_yr)
ed_tm = time.time()
diff_tm = ed_tm - sd_tm
print(f"loading CSV and getting data from population database took {diff_tm:.2f} ({sd_tm:.6f}-{ed_tm:.6f})")
# get world median age for current year
w_median = dstat.get_wt_quartile(_L_LOWS, c_data['World'])
for _ in range(t_rpt):
mean_mds, wt_mean_mds, summed_md = get_sample(t_size)
#print(mean_mds, wt_mean_mds, summed_md)
lst_means.append(mean_mds)
lst_wt_means.append(wt_mean_mds)
lst_summed_mdns.append(summed_md)
# simple sample stats
mean_means = statistics.mean(lst_means)
sd_means = statistics.stdev(lst_means)
serr_means = 1.96 * sd_means / (t_rpt**0.5)
means_p95 = mean_means + serr_means
means_m95 = mean_means - serr_means
mean_wtd = statistics.mean(lst_wt_means)
sd_wtd = statistics.stdev(lst_wt_means)
serr_wtd = 1.96 * sd_wtd / (t_rpt**0.5)
wtd_p95 = mean_wtd + serr_wtd
wtd_m95 = mean_wtd - serr_wtd
mean_sumd = statistics.mean(lst_summed_mdns)
sd_sumd = statistics.stdev(lst_summed_mdns)
serr_sumd = 1.96 * sd_sumd / (t_rpt**0.5)
sumd_p95 = mean_sumd + serr_sumd
sumd_m95 = mean_sumd - serr_sumd
print(f"Random sample of {t_size} countries, repeated {t_rpt} times")
print(f"\n{'Simple mean:':14} {mean_means:.2f}, std dev: {sd_means:.2f}, 95%: {means_m95:.2f} - {means_p95:.2f}")
print(f"{'Weighted mean:':14} {mean_wtd:.2f}, std dev: {sd_wtd:.2f}, 95%: {wtd_m95:.2f} - {wtd_p95:.2f}")
print(f"{'Summed median:':14} {mean_sumd:.2f}, std dev: {sd_sumd:.2f}, 95%: {sumd_m95:.2f} - {sumd_p95:.2f}")
es_tm = time.time()
diff_tm = es_tm - ed_tm
print(f"\ngenerating sample stats took {diff_tm:.2f} ({ed_tm:.6f}-{es_tm:.6f})")
h_lists = [lst_means, lst_wt_means, lst_summed_mdns]
if n_tst == 1:
print(f"\nTest #{n_tst}: {tst_desc[n_tst]}\n")
fig, axs = plt.subplots(2, 2, sharey=True, gridspec_kw={'hspace': .3}, figsize=(15,8))
axs[1, 1].axis('off')
fig.suptitle(f"Estimated World Average Age: {use_yr}")
p_row = 0
for i in range(len(h_lists)):
plot_hist(axs[p_row, i % 2], h_lists[i], use_yr, smpl_types[i], t_size, t_rpt)
if i > 0 and i % 2 == 1:
p_row += 1
elif n_tst == 2:
print(f"\nTest #{n_tst}: {tst_desc[n_tst]} -- NOT CURRENTLY AVAILABLE\n")
elif n_tst == 3:
print(f"\nTest #{n_tst}: {tst_desc[n_tst]} -- NOT CURRENTLY AVAILABLE\n")
elif n_tst == max_tst:
print(f"\nTest #{n_tst}: {tst_desc[n_tst]}\n")
fig, ax = plt.subplots(figsize=(10,6))
fig.suptitle(f"Estimated World Average Age: {use_yr}")
plot_hist(ax, h_lists[t_stat], use_yr, smpl_types[t_stat], t_size, t_rpt)
es_tm = time.time()
diff_tm = es_tm - sd_tm
print(f"whole sampling process took {diff_tm:.2f}")
plt.show()
And, here’s a test run using all the command line parameters.
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 4 -s 2 -c 10 -r 500 -y 2020
loading CSV and getting data from population database took 4.92 (1605536641.538051-1605536646.457991)
Random sample of 10 countries, repeated 500 times
Simple mean: 29.95, std dev: 3.08, 95%: 29.68 - 30.22
Weighted mean: 30.52, std dev: 4.93, 95%: 30.09 - 30.96
Summed median: 29.94, std dev: 5.04, 95%: 29.49 - 30.38
World median: 30.90
generating sample stats took 1.18 (1605536646.457991-1605536647.639950)
Test #4: Plot Single Statistic, One Histogram
whole sampling process took 6.73