Average World Age

Well, last time I took us in something akin to a merry-go-round. Start, test, rethink, try something else, test, rethink… In the process we created a couple of new CSV files to hopefully help achieve our goal.

Set Up the Data In-memory

So what I am going to do now is read our modified country-continent CSV into a dataframe. Get a list of the ‘valid’ country names. And use that list to get a dictionary of the population data for each of those countries for our specific year (currently 2011). And, see if I run into any problems. Particulary any memory issues. I will print the data for a few countries, selected at random, to make sure all is good.

My first attempt did not go so well.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
list of valid country names has 199 entries.
c_data has 45 entries
loading CSV and getting data from population database took 5.00
Traceback (most recent call last):
  File "R:\learn\py_play\population\play\get_avg_age.test.py", line 239, in <module>
    print(f"{c_nm} {c_data[c_nm]}")
KeyError: 'Micronesia (Fed. States of)'

Note that c_data has 45 entries and our country list has 199 entries. They both should have the same number of entries. For a moment I thought I’d run out of memory. But, that would likely have generated an error message. So, I did some troubleshooting. After a bunch of debug print statements in a number of modules, I discovered that the code in database.population.get_crs_1yr_all() was not finding Curaçao in the population data CSV. Even though we know it has to be there. So, I commented out that row in the CSV file and tried again. Not much improvement.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
list of valid country names has 198 entries.
c_data has 47 entries
loading CSV and getting data from population database took 5.03
Traceback (most recent call last):
  File "R:\learn\py_play\population\play\get_avg_age.test.py", line 239, in <module>
    print(f"{c_nm} {c_data[c_nm]}")
KeyError: 'Guyana'

After some more hunting, I commented out the row containing Côte d’Ivoire and tried again. Another improvement, but not there yet. Some more hunting, commented out the row containing Réunion. Then…

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
list of valid country names has 196 entries.
c_data has 196 entries
loading CSV and getting data from population database took 4.88
Sweden [562.002, 522.975, 478.306, 632.243, 646.133, 590.521, 576.566, 626.032, 670.657, 636.591, 592.705, 561.881, 618.621, 546.018, 398.516, 304.882, 243.128, 168.743, 72.327, 16.251, 1.607]
New Caledonia [22.323, 20.85, 22.726, 19.463, 19.793, 20.31, 19.416, 21.032, 18.95, 17.605, 14.286, 11.401, 10.011, 7.648, 5.195, 3.116, 1.817, 0.779, 0.293, 0.049, 0.003]
Afghanistan [5365.812, 4833.884, 4166.801, 3280.454, 2576.976, 2152.51, 1772.48, 1417.582, 1144.33, 921.964, 733.987, 578.674, 462.821, 317.15, 201.68, 113.832, 53.44, 18.616, 3.947, 0.448, 0.023]
Sri Lanka [1775.91, 1743.743, 1631.746, 1615.903, 1565.987, 1597.359, 1622.33, 1431.757, 1350.933, 1293.717, 1211.632, 1038.353, 931.814, 647.886, 380.02, 261.751, 179.491, 86.779, 26.546, 4.442, 0.397]
Mali [3021.222, 2430.086, 1952.173, 1614.901, 1343.423, 1139.322, 951.953, 766.145, 579.544, 432.551, 344.889, 295.858, 222.178, 180.71, 121.254, 71.669, 34.052, 10.69, 1.835, 0.135, 0.003]

You will notice all three names have in common the use of accented characters. I spent some considerable time searching the web for a solution, but have not yet found one that works. So, for now, I am just going to leave the three countries commented out. I don’t expect that will have a serious impact on our attempt to work out the world’s average age for 2011.

I was doing some other testing, days later, when something stumbled into my mind. Turns out I wasn’t specifying the file encoding when opening the CSV files with the plain Python file open() function. So, I added encoding='utf-8' to the open functions parameters in the relevant database/population.py functions and voilà. So, I also did the same in database/rc_names.py. Though, I had understood Python defaulted to UTF-8, so really not sure why specifying the encoding was necessary.

I also discovered that there was data for the Channel Islands in the population database CSV, but no such country in the country-continent CSV file. And, the country number for Sudan did not match in the two files. So, I modified make_mod_country_csv() to add the two countries to the dataframe before writing it to the CSV file. I then ran the two functions to create updated versions of the two CSV files and tested that everything now worked.

One of my tests was to add up all the population values, by age group for the year 2011, for the countries in the CSV file and compare that to the values for World. The sum of the countries for each age group is less than that of the entry for World. Not sure what impact that has on the attempt to use sampling to estimate the world’s median age for any given year. But, I am happier with how things are working.

And, my code so far.


...

# in the defaults near the top of the module
_DO_SAMPLING = True

...

# in the if __name__ == '__main__': block
  if _DO_SAMPLING:
    sd_tm = time.time()
    
    df_country = pd.read_csv(D_PATH / _CSV_MOD_CC, comment='#', usecols = ['continent','country','valid'])

    valid_c = df_country['valid'].tolist()
    valid_c.sort()
    print(f"list of valid country names has {len(valid_c)} entries.")

    c_data = pdb.get_crs_1yr_all(valid_c, use_yr)
    print(f"c_data has {len(c_data)} entries")

    ed_tm = time.time()
    diff_tm = ed_tm - sd_tm
    print(f"loading CSV and getting data from population database took {diff_tm:.2f}")

    # simple test
    for _ in range(5):
      rand_r = df_country.sample()
      c_nm = rand_r['valid'].iloc[0]
      print(f"{c_nm} {c_data[c_nm]}")

Calculate Sample Statistics for a Single Sample of 5 Countries

Now for a random selection of 5 countries taken from the dataframe, let’s generate the median age for each, the mean of the medians, etc. And, print out the results. Slow steps. Because I plan to repeat this process numerous times, and possibly with varying numbers of randomly selected countries, I am going to put the whole process into a function: get_sample(r_nbr=_S_SIZE). Where _S_SIZE = 5 is a default value specified with all the other defaults early in the module. For now it is set to 5.

I plan to have the country list dataframe and the dictionary of data all loaded before the function is ever called. So, I am for now going to make those two variable globally accessible in the function. I do plan to calculate a couple of different values for plotting — curiousity. But, for now, I will just generate the mean of the _S_SIZE values to return from the function. And, simple test with some extra print statements produced the following.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
list of valid country names has 196 entries.
c_data has 196 entries
loading CSV and getting data from population database took 4.90
{'China, Macao SAR': 36.525987828343844, 'Solomon Islands': 19.520151155950867, 'Georgia': 37.12273896129585, 'Comoros': 19.23738841193252, 'Madagascar': 18.12739031739944}
mean of the sample medians is 26.106731334984506

Couple of repeats got me similar results. Here’s my function and the reworked simple test.

def get_sample(r_nbr=_S_SIZE):
  global df_country, c_data
  s_medians = {}
  mean_ms = 0

  for _ in range(r_nbr):
    rand_r = df_country.sample()
    c_nm = rand_r['valid'].iloc[0]
    s_data = c_data[c_nm]
    s_median = dstat.get_wt_quartile(_L_LOWS, c_data[c_nm])
    s_medians[c_nm] = s_median

  # during dev
  print(s_medians)

  mean_ms = sum(s_medians.values()) / r_nbr

  return mean_ms
    # simple test
    mean_mds = get_sample()
    print(f"mean of the sample medians is {mean_mds}")

Extra Statistics

Because I am somewhat uncertain that the simple mean of the medians is the best/proper sample measure, I am also going to generate a couple of other values. One is the weighted mean of the medians — weighted by total country population. And, the median of the sum of the population values for the sample’s countries. That is the sum by age group of all the countries in the sample. Really not sure which is the best value. Though I would think the weighted mean should approach the median of the summed populations for a large enough sample size. But haven’t confirmed that guess just yet.

I will modify the function get_sample() accordingly. Looks a bit messy but…

def get_sample(r_nbr=_S_SIZE):
  global df_country, c_data
  s_medians = {}
  mean_ms = 0
  wt_mean_ms = 0
  sum_median = 0
  c_pops = []
  sum_pop = [0 for _ in range(21)]
  c_tot = {}
  
  for _ in range(r_nbr):
    rand_r = df_country.sample()
    c_nm = rand_r['valid'].iloc[0]
    #print(f"{c_nm} is {'in' if c_nm in c_data.keys() else 'not in'} c_data")
    #c_data = d_dict[rand_r['valid'].iloc[0]]
    s_data = c_data[c_nm]
    c_pops.append(s_data)
    c_tot[c_nm] = sum(s_data)
    s_median = dstat.get_wt_quartile(_L_LOWS, c_data[c_nm])
    s_medians[c_nm] = s_median

  mean_ms = sum(s_medians.values()) / r_nbr
  wt_medians = [s_medians[nm] * c_tot[nm] for nm in s_medians.keys()]
  wt_mean_ms = sum(wt_medians) / sum(c_tot.values())
  sum_pop = [sum(n) for n in zip(*c_pops)]
  sum_median = dstat.get_wt_quartile(_L_LOWS, sum_pop)
 
  return (mean_ms, wt_mean_ms, sum_median)

And, the simple test code now looks like.

    # test function
    mean_mds, wt_mean_mds, summed_md = get_sample()
    print(f"simple mean: {mean_mds}, weighted mean: {wt_mean_mds}, summed median: {summed_md}")
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
list of valid country names has 196 entries.
c_data has 196 entries
loading CSV and getting data from population database took 4.91 (1605221190.002691-1605221194.911092)
simple mean: 31.547336834512812, weighted mean: 31.421215496274694, summed median: 30.736214161592795

Okay that appears to work.

And, as I increase the number of countries in the sample, the weighted mean and the summed median do appear to get closer to each other. Overly simple test, but it may be pointing us in the right direction.

# 8 countries
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
loading CSV and getting data from population database took 4.91 (1605221190.002691-1605221194.911092)
simple mean: 31.547336834512812, weighted mean: 31.421215496274694, summed median: 30.736214161592795

# 30 countries
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
loading CSV and getting data from population database took 4.79 (1605221353.560968-1605221358.350966)
simple mean: 27.96474855174457, weighted mean: 33.407642898889165, summed median: 32.99758724100588

# 100 countries
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
loading CSV and getting data from population database took 4.78 (1605221399.632377-1605221404.407910)
simple mean: 23.218238263739963, weighted mean: 28.25301587208027, summed median: 28.32396935375201

Repeated Samples

Now, let’s get a whole bunch of samples and look at the mean and standard deviation of each of the three values we are generating for each sample. I will start with a sample of 5 countries over 100 repetitions. A new default variable as well:

_S_SIZE = 5
_REPEATS = 100

And, I get the following from a couple of executions. Bit of variation, as is to be expected.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
loading CSV and getting data from population database took 4.68 (1605223546.286841-1605223550.967052)
Random sample of 5 countries, repeated 100 times

Simple mean:   27.53, std dev: 3.29, 95%: 26.89 - 28.18
Weighted mean: 27.70, std dev: 5.43, 95%: 26.64 - 28.77
Summed median: 27.19, std dev: 5.42, 95%: 26.13 - 28.25

generating sample stats took 0.13 (1605223550.967052-1605223551.098606)

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\get_avg_age.test.py
loading CSV and getting data from population database took 4.90 (1605225093.328681-1605225098.230928)
Random sample of 5 countries, repeated 100 times

Simple mean:   28.74, std dev: 4.02, 95%: 27.96 - 29.53
Weighted mean: 28.43, std dev: 6.43, 95%: 27.17 - 29.69
Summed median: 27.90, std dev: 6.56, 95%: 26.62 - 29.19

generating sample stats took 0.12 (1605225098.230928-1605225098.348762)

And, my code is as follows. Do have a look at Estimating a Population Mean to see where I got introduced to the concepts of standard error and confidence interval.

    for _ in range(_REPEATS):
      mean_mds, wt_mean_mds, summed_md = get_sample()
      lst_means.append(mean_mds)
      lst_wt_means.append(wt_mean_mds)
      lst_summed_mdns.append(summed_md)

    # simple sample stats
    mean_means = statistics.mean(lst_means)
    sd_means = statistics.stdev(lst_means)
    serr_means = 1.96 * sd_means / (_REPEATS**0.5)
    means_p95 = mean_means + serr_means
    means_m95 = mean_means - serr_means

    mean_wtd = statistics.mean(lst_wt_means)
    sd_wtd = statistics.stdev(lst_wt_means)
    serr_wtd = 1.96 * sd_wtd / (_REPEATS**0.5)
    wtd_p95 = mean_wtd + serr_wtd
    wtd_m95 = mean_wtd - serr_wtd

    mean_sumd = statistics.mean(lst_summed_mdns)
    sd_sumd = statistics.stdev(lst_summed_mdns)
    serr_sumd = 1.96 * sd_sumd / (_REPEATS**0.5)
    sumd_p95 = mean_sumd + serr_sumd
    sumd_m95 = mean_sumd - serr_sumd

    print(f"Random sample of {_S_SIZE} countries, repeated {_REPEATS} times")
    print(f"\n{'Simple mean:':14} {mean_means:.2f}, std dev: {sd_means:.2f}, 95%: {means_m95:.2f} - {means_p95:.2f}")
    print(f"{'Weighted mean:':14} {mean_wtd:.2f}, std dev: {sd_wtd:.2f}, 95%: {wtd_m95:.2f} - {wtd_p95:.2f}")
    print(f"{'Summed median:':14} {mean_sumd:.2f}, std dev: {sd_sumd:.2f}, 95%: {sumd_m95:.2f} - {sumd_p95:.2f}")

    es_tm = time.time()
    diff_tm = es_tm - ed_tm
    print(f"\ngenerating sample stats took {diff_tm:.2f} ({ed_tm:.6f}-{es_tm:.6f})")

Before looking at using bigger numbers, let’s look at plotting our sample results.

Plotting Simulation Results

I have been calling our repeated samples a sample which might get confusing. So, I will for now refer to it as a simulation.

Some subplot() Basics

Our simulation generates three values. So, rather than plotting three separate histograms in three separate figures, I am going to plot the three histograms in a single figure in a 2x2 grid. This is going to require a change in our call to plt.subplots(). And modification in how we call the code that generates the elements for each plot (grid item).

The call will look similar to our previous efforts, but there will be one big difference in the return values. Something like: fig, axs = plt.subplots(2, 2, figsize=(15,8)). fig will still be a reference to our figure object. But, axs will be a 2D numpy array of Axes objects. We will need to access each Axes individually to plot our histograms. Something like:

             |             

  axs[0, 0]  |  axs[0, 1]  

             |             
- - - - - - - - - - - - - -
             |             

  axs[1, 0]  |  axs[1, 1]  

             |             

I will as usual start slow and work my way up. So, I will generate the figure with a 2x2 grid and generate my three histograms without too much thought about how they look. Then I will do what I can to tidy things up. Then finally I will add the mean and 95% confidence intervals to the three histograms. But as I said, slowly and little by little.

Oh, yes, I would like to use a loop to plot the three histograms, so I will likely need to add a few additional variables. Give it a shot and see you back here in a bit. Assuming I am going to loop over range(nbr_plots), I can get the correct column value for the axs array by using the modulus operator. A favourite for programmers in this sort of situation. E.G.

for i in range(3):
  a_col = i % 2
  print(f"{a_col} ", end=''

# output looks like
0 1 0 

The row index will take a bit more work. We will initialize it before the loop. Then at the end of the loop, if i > 0 we will check if i % 2 == 1 — that is we have just plotted in the last Axes on that row — so we need to increment the row index for the next run through the loop. Something like

p_row = 0
for i in range(3):
  print(f"{i}: do something with axs[{p_row}, {i % 2}]")
  if i > 0 and i % 2 == 1:
    p_row += 1

# output looks like
0: do something with axs[0, 0]
1: do something with axs[0, 1]
2: do something with axs[1, 0]

And, a little test:

    fig, axs = plt.subplots(2, 2, figsize=(15,8))
    p_row = 0
    for i in range(3):
      axs[p_row, i % 2].text(.4, .5, f"axs[{p_row}, {i % 2}]", fontsize=24)
      if i > 0 and i % 2 == 1:
        p_row += 1

    plt.show()
image showing result of above test accessing multiple Axes in a figure

As I will be generating a histogram multiple times, I am going to put that code into a separate function, plot_hist(). One of its parameters will be the Axes on which it should be plotted. I will also pass in the info needed to produce the histogram, as well as any info needed to generate a title, the axis labels, etc.

I also want the axis information to be shared between plots, so I will be using sharex=True, sharey=True in my call to plt.subplot. And I am going to specify a hspace value as well (a result of trial and error as I wrote the code.) And, because we won’t be plottying to axs[1, 1], I am going to try hiding it. So, my subplot generation will look something like:

fig, axs = plt.subplots(2, 2, sharex=True, sharey=True, gridspec_kw={'hspace': .3}, figsize=(15,8))
axs[1, 1].axis('off')

And Our Three Histograms

As mentioned, I will be using a function to draw the histograms. So, let’s start with that. I will need to pass the Axes on which it will operate. I need to pass the data. And, I need some info for the title and x-axis labels. I will be adding a list with information about what is being plotted to cover the latter. And passing in the year for use in the title.

def plot_hist(ax, s_yr, s_type, h_data):
  ax.set_xlabel(f'Average Age ({s_type})')
  ax.set_title(f"Estimated World Average Age: {s_yr} ({_S_SIZE} X {_REPEATS})")
  n, bins, patches = ax.hist(h_data, bins=range(0, 60))

I am using a smaller range for the histograms, since the average age is unlikely to be much over 45 or so. If I left it at 0 - 110 the histgrams get a little bunched up in the middle of the chart.

Because I will be using a loop, I am going to put my three data lists into another list. The position in that list needs to match the position of the appropriate informaton in the list for the x-axis label. This does not create a duplicate of the data. Instead Python puts a reference to the first list in the second. So no additional damage to memory use.

...
# run code to get data (see above)
...
smpl_types =['Simple Mean of Sample Medians', 'Weighted Mean of Sample Medians', 'Median of Summed Sample Data']
h_lists = [lst_means, lst_wt_means, lst_summed_mdns]

fig, axs = plt.subplots(2, 2, sharex=True, sharey=True, gridspec_kw={'hspace': .3}, figsize=(15,8))
axs[1, 1].axis('off')

p_row = 0
for i in range(len(h_lists)):
  plot_hist(axs[p_row, i % 2], use_yr, smpl_types[i], h_lists[i])
  if i > 0 and i % 2 == 1:
    p_row += 1

plt.show()

And, bing, bang, bam:

image showing figure with the 3 sets of simulation data plotted as histograms

Which unfortunately left me without any x-axis information for axs[0, 1]. So, I had to be more specific. I could still share the y-axis. But had to handle the x-axis myself.

fig, axs = plt.subplots(2, 2, sharey=True, gridspec_kw={'hspace': .3}, figsize=(15,8))
axs[1, 1].axis('off')
plt.setp(axs[0, 0].get_xticklabels(), visible=False)

Well, would be nice to have a title for the complete figure. And, I’d like to have the mean, and confidence intervals shown on each plot. Along with a suitable legend. So, let’s add some more code to the histogram function. I looked at passing in the extra data, but in the end decided to calculate it in the function before plotting. And, above our plotting loop we’ll add a title to the figure. Since there’s no sense having the same title information for the figure and the plots, I changed the plot title to better indicate the number of countries and repetitions used for the plot.

def plot_hist(ax, s_yr, s_type, h_data):
  ax.set_xlabel(f'Age ({s_type})')
  #ax.set_title(f"Estimated World Average Age: {s_yr} ({_S_SIZE} X {_REPEATS})")
  ax.set_title(f"Simulation: {_S_SIZE} countries X {_REPEATS} repetitions")
  n, bins, patches = ax.hist(h_data, bins=range(0, 60))

  s_mean = statistics.mean(h_data)
  s_stdev = statistics.stdev(h_data)
  std_err = 1.96 * s_stdev / (_REPEATS**0.5)
  mean_p95 = s_mean + std_err
  mean_m95 = s_mean - std_err

  ax.axvline(s_mean, 0, 1, color='c', label=f'Est. Average Age: {s_mean:.2f}')
  ax.axvline(mean_m95, 0, 1, color='g', label=f'95% conf interval: {mean_m95:.2f}')
  ax.axvline(mean_p95, 0, 1, color='m', label=f'95% conf interval: {mean_p95:.2f}')
  # not actually plotting anything as such, just want value to show in legend area
  ax.axvline(s_stdev, 0, 0, color='k', label=f'std dev: {s_stdev:.2f}')
  ax.legend()

# and down in my plotting code

smpl_types =['Simple Mean of Sample Medians', 'Weighted Mean of Sample Medians', 'Median of Summed Sample Data']
h_lists = [lst_means, lst_wt_means, lst_summed_mdns]

fig, axs = plt.subplots(2, 2, sharey=True, gridspec_kw={'hspace': .3}, figsize=(15,8))
axs[1, 1].axis('off')
fig.suptitle(f"Estimated World Average Age: {use_yr}")

p_row = 0
for i in range(len(h_lists)):
  plot_hist(axs[p_row, i % 2], use_yr, smpl_types[i], h_lists[i])
  if i > 0 and i % 2 == 1:
    p_row += 1

plt.show()

And I get something like:

image showing figure with the updated titles and axis information

Done, Once Again…

Well, once again, I have decided to end a post long before I exepcted to. But, you know, I think we’ve covered a decent amount of information and have things in a good position to do some further experimenting with our simulation model next time.

But, before I call it a day personally, I am going to add some command line arguments to my code. I want to be able to control what the module does in future. Like change the year, the number of countries per sample, or the total number of repetitions. So, I am going to add the code for command line arguments to my code. Perhaps altering some of the variable names, etc. as I go along. Will also put the current plotting code into a suitable if block. I expect to do a few different styles of plotting of various simulation parameters.

Once done, I will add a link to view my module code as it currently stands. And, I have decided to move it out of test status so that I can track code changes (Git). Code will be in tracked module population/est_avg_age.py. Will look after all of that before I start on the draft of the next post.

Need to confirm that code in post below matches what’s currently in the module.
Perhaps also add post with code from rc_names.py and population.py.

  • Code for est_avg_age.py as of the writing of this draft. It has since changed somewhat. But I think the code referenced here is compatible with the above discussion.

Resources