Well, may be this time?

Average World Age — A Review

First a bit of a review. The goal was to estimate the world’s average age for a given year by sampling. That is by getting the average age for a number of countries and calculating the average of those country’s average age. For a single countries average age, I am using the median of the binned data we have available in our data population database (a CSV file). For the average of the countries in the sample I am going to use the weighted mean for all the countries in the sample. For comparison, I will also calcuate and track the median for the sum of the population for all the countries in the sample. That sum is calculated by age group, so we can use our function to get the binned median.

We have so far created some additional CSV files to help us select valid country names at random. We’ve written a function to get a sample and return the values of interest for a specified sample size. To speed things up, we are getting all the necessary information one time and storing it in memory variables. The sampling function uses those memory variables to generate its results.

The plan is to repeat the sampling multiple times, and calculate the mean of all the samples to estimate the world’s (the population for our statistic) average age. We also calculate the standard error and 95% confidence interval(s).

In addition, We wrote a function to plot a histogram of the repeated sample data, along with the mean and confidence interval.

Finally, we wrote some experiments to test our basic code. We are now ready to write some more experiments to test our estimating methodology. And, the impact various parameters (e,g, sample size, number of repeitions) have on the outcome.

Some Notes Related to Experiments

I have decided to use the “weighted mean” as my primary statistic for estimating the average world age. Using means generally results in a normal distribution. And, the weighted mean takes account of the individual contribution of each country with respect to the world’s total population.

I plan to focus on — investigate? — what happens in two cases. Firstly what happens when we alter the number of repetitions for a fixed number of countries per sample. Secondly, what happens when for a fixed number of repetitions we alter the number of countries per sample. I will be putting together new experiments for these two cases.

est_avg_age.py

You will recall I was going to refactor my test code and put it into a new module which I would track (Git). I have made the code for the module available in an unlisted post. I will continue by adding the new experiments to this module. Refactoring existing code as required.

First Let’s Look at a Single Sample of Varying Sizes

For now, I will continue to look at three measures of the average age: ‘simple mean of all sample averages’, ‘weighted mean of all sample averages’, and the ‘summed population median for all sample population data’. Mainly because that’s what the current code for our base experiment will generate. I may or may not modify the code at a later time. Let’s look at the results for single sample sizes of 20, 30, 50 and 100.

(base) PS R:\learn\py_play> conda activate base-3.8
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -c 20 -r 1
loading CSV and getting data from population database took 4.49 (1606239075.362762-1606239079.856127)
Traceback (most recent call last):
  File "R:\learn\py_play\population\est_avg_age.py", line 267, in <module>
    sd_means = statistics.stdev(lst_means)
  File "E:\appDev\Miniconda3\envs\base-3.8\lib\statistics.py", line 797, in stdev
    var = variance(data, xbar)
  File "E:\appDev\Miniconda3\envs\base-3.8\lib\statistics.py", line 739, in variance
    raise StatisticsError('variance requires at least two data points')
statistics.StatisticsError: variance requires at least two data points

Well, that didn’t go so well. Seems I always manage to break something.

I do want to look at the results of single samples. If nothing else it should provide some sort of benchmark for later experiments. So, a code refactoring or a new test file is in order. I think I will try the refactoring approach. Guess this post may not, once agtain, get as far as I hoped.

I will start by modifying the code that generates the basic statistics once all the data is loaded into memory and a sample is or multiple samples are obtained. Then I will remove the calculations for the mean, standard deviation, standard error, etc. from the histogram generation function. Modifying the parameters to include whatever is needed to generate the added lines on the histogram.

Then test to make sure everything works for a single sample and for repeated sampling. I will, somewhere along the line, at an appropriate time add a new experiment/test for the single sample case, generating the desired output (terminal and chart — though likely incrementally). Expect this will take a bit of time, and numerous commits.

Generating Basic Statsistics for Both Cases

And almost immediately a small problem. To do what I would like for the single sample case, I need the actual country data, median and total population, produced by the sampling process. Our current function, get_sample(r_nbr), doesn’t return that information. It uses it to generate the 3 data types (statistics?) which it returns. Thought about modifying get_sample(), but decided that wasn’t the best approach. Will also make testing simpler as I won’t be messing, yet, with existing functions.

But, experiment/test code will likely get a bit more messy. Though, once again, new functions might help with that. We shall see.

So going to write a new function, get_single_sample(r_nbr), which will get samples for r_nbr countries and return the necessary data. I can then do what I want with that data. I have decided to return, so far, two dictionaries, both keyed on country name. One for the median age and one for the total population. Thought about returning one dictionary with lists or tuples for the values, but figured that employing the data might get messy.

I may write another function to do the work of converting the raw data, returning only the values needed to plot the histogram(s).

Turns out I did the calculating in the main code, with an if statement to check if doing single sample or repeating samples. Do what’s appropriate in the if and else blocks. Subsequent code uses the values generated here. So, added a bunch of variable declarations at the top of the if _DO_SAMPLING: block (don’t want pylint or pylance complaining).

Then put the code to display the numbers in the terminal into two separate functions, one for the single sample case (display_stats_1_sample()) and one for the repetitive sampling case (display_stats_w_rpts()). Calling them as appropriate in the working code block. With respect to calculating the standard deviation for the weighted mean, I refer you to the stackoverflow item in the Resources section at the bottom of the post. I have not in anyway attempted to understand how it works. I was going to use the statsmodels.stats.weightstats.DescrStatsW class, but didn’t feel like installing a new package. Though I may do so, just to compare the result with the calculation I did use (from near the bottom of that stackoverflow discussion).

Also since in this case the “summed median” value is effectively a single sample, I used the standard deviation and standard error for the weighted mean sample. Just so I had some values to use for the output and eventually for the plot.

Oh, yes. Also added the basics for a new test. Note change to the dictionary tst_desc in the output of the git diff below. And, the addition of a new elif n_tst == 5:.

Here’s my additions and changes.

# with the other imports
import numpy as np

...

# new function definitions

def get_single_sample(r_nbr):
  global df_country, c_data
  c_medians = {}
  c_totals = {}
  wt_mean = 0
  sum_median = 0
  c_pops = []
  sum_pop = [0 for _ in range(21)]
  
  for _ in range(r_nbr):
    rand_r = df_country.sample()
    c_nm = rand_r['valid'].iloc[0]
    s_data = c_data[c_nm]
    c_pops.append(s_data)
    c_totals[c_nm] = sum(s_data)
    c_median = dstat.get_wt_quartile(_L_LOWS, s_data)
    c_medians[c_nm] = c_median

  wt_medians = [c_medians[nm] * c_totals[nm] for nm in c_medians.keys()]
  wt_mean = sum(wt_medians) / sum(c_totals.values())
  sum_pop = [sum(n) for n in zip(*c_pops)]
  sum_median = dstat.get_wt_quartile(_L_LOWS, sum_pop)

  return c_medians, c_totals, wt_mean, sum_median


def display_stats_w_rpts():
  print(f"Random sample of {t_size} countries, repeated {t_rpt} times")

  print(f"\n{'':14} | {'Avg':>5} | {'std dev'} | {'norm 95% conf':13} | {'t 95% conf':13}")
  print(f"{'':14} | {'':5} | {'':7} | n={t_rpt:<11} | n={t_size:<11}")
  print(f"\n{'Simple mean:':14} | {mean_means:.2f} | {sd_means:7.2f} | {means_m95:.2f} - {means_p95:.2f} | {means_t_m95:.2f} - {means_t_p95:.2f}")
  print(f"{'Weighted mean:':14} | {mean_wtd:.2f} | {sd_wtd:7.2f} | {wtd_m95:.2f} - {wtd_p95:.2f} | {wtd_t_m95:.2f} - {wtd_t_p95:.2f}")
  print(f"{'Summed median:':14} | {mean_sumd:.2f} | {sd_sumd:7.2f} | {sumd_m95:.2f} - {sumd_p95:.2f} | {sumd_t_m95:.2f} - {sumd_t_p95:.2f}")
  print(f"{'World median:':14} | {w_median:.2f}")


def display_stats_1_sample():
  print(f"Random sample of {t_size} countries, one time")

  print(f"\n{'':14} | {'Avg':>5} | {'std dev'} | {'std err':7} | {'t 95% conf':13}")
  print(f"{'':14} | {'':5} | {'':7} | {'':7} | n={t_size:<11}")
  print(f"\n{'Simple mean:':14} | {mean_means:.2f} | {sd_means:7.2f} | {terr_means:7.2f} | {means_t_m95:.2f} - {means_t_p95:.2f}")
  print(f"{'Weighted mean:':14} | {mean_wtd:.2f} | {sd_wtd:7.2f} | {terr_wtd:7.2f} | {wtd_t_m95:.2f} - {wtd_t_p95:.2f}")
  print(f"{'Summed median:':14} | {mean_sumd:.2f} | {sd_sumd:7.2f} | {terr_sumd:7.2f} | {sumd_t_m95:.2f} - {sumd_t_p95:.2f}")
  print(f"{'World median:':14} | {w_median:.2f}")

And, here’s a git diff for the main experiment/test coding area.

 if __name__ == '__main__':
   tst_desc = ['na',
-              "Plot Histograms of 3 Types of Sample Data",
+              "Plot Histograms of 3 Types of Repeated Sample Data",
               "Plot 4 histograms showing sample statistic for various repetitions",
               "Plot 4 histograms showing sample statistic for various sample sizes",
               "Confidence Interval Variations",
+              "Generate and plot a single sample of the specified size",
               # following should always be the last test
               "Plot Single Statistic, One Histogram"
   ]
@@ -236,7 +286,14 @@ if __name__ == '__main__':
     lst_means = []
     lst_wt_means = []
     lst_summed_mdns = []
-
+    s_medians = {}
+    s_totals = {}
+    sumd_mds = 0
+    mean_means, sd_means, serr_means, terr_means = (0, 0, 0, 0)
+    mean_wtd, sd_wtd, serr_wtd, terr_wtd = (0, 0, 0, 0)
+    mean_sumd, sd_sumd, serr_sumd, terr_sumd = (0, 0, 0, 0)
+    h_lists = [[], [], []]
+
     sd_tm = time.time()

     df_country = pd.read_csv(D_PATH / _CSV_MOD_CC, comment='#', usecols = ['continent','country','valid'])
@@ -255,64 +312,75 @@ if __name__ == '__main__':
     w_median = dstat.get_wt_quartile(_L_LOWS, c_data['World'])
     #w_median = dstat.get_wt_mean(_L_MIDS, c_data['World'])

-    for _ in range(t_rpt):
-      mean_mds, wt_mean_mds, summed_md = get_sample(t_size)
-      #print(mean_mds, wt_mean_mds, summed_md)
-      lst_means.append(mean_mds)
-      lst_wt_means.append(wt_mean_mds)
-      lst_summed_mdns.append(summed_md)
-
-    # simple sample stats
-    mean_means = statistics.mean(lst_means)
-    sd_means = statistics.stdev(lst_means)
-    serr_means = 1.96 * sd_means / (t_rpt**0.5)
+    if t_rpt == 1:
+      s_medians, s_totals, mean_wtd, mean_sumd = get_single_sample(t_size)
+
+      mean_means = statistics.mean(s_medians.values())
+      sd_means = statistics.stdev(s_medians.values())
+      serr_means = 0
+      terr_means = t_conf_int(t_size, sd_means)
+
+      sd_wtd = np.sqrt(np.cov(list(s_medians.values()), aweights=list(s_totals.values())))
+      serr_wtd = 0
+      terr_wtd = t_conf_int(t_size, sd_wtd)
+
+      sd_sumd = sd_wtd
+      serr_sumd = 0
+      terr_sumd = t_conf_int(t_size, sd_sumd)
+    else:
+      for _ in range(t_rpt):
+        mean_mds, wt_mean_mds, summed_md = get_sample(t_size)
+        #print(mean_mds, wt_mean_mds, summed_md)
+        lst_means.append(mean_mds)
+        lst_wt_means.append(wt_mean_mds)
+        lst_summed_mdns.append(summed_md)
+
+        mean_means = statistics.mean(lst_means)
+        sd_means = statistics.stdev(lst_means)
+        serr_means = 1.96 * sd_means / (t_rpt**0.5)
+        terr_means = t_conf_int(t_size, sd_means)
+
+        mean_wtd = statistics.mean(lst_wt_means)
+        sd_wtd = statistics.stdev(lst_wt_means)
+        serr_wtd = 1.96 * sd_wtd / (t_rpt**0.5)
+        terr_wtd = t_conf_int(t_size, sd_wtd)
+
+        mean_sumd = statistics.mean(lst_summed_mdns)
+        sd_sumd = statistics.stdev(lst_summed_mdns)
+        serr_sumd = 1.96 * sd_sumd / (t_rpt**0.5)
+        terr_sumd = t_conf_int(t_size, sd_sumd)
+
+        h_lists = [lst_means, lst_wt_means, lst_summed_mdns]
+
+    # calc 95% confidence intervals
     means_p95 = mean_means + serr_means
     means_m95 = mean_means - serr_means
-    terr_means = t_conf_int(t_size, sd_means)
     means_t_p95 = mean_means + terr_means
     means_t_m95 = mean_means - terr_means

-    mean_wtd = statistics.mean(lst_wt_means)
-    sd_wtd = statistics.stdev(lst_wt_means)
-    serr_wtd = 1.96 * sd_wtd / (t_rpt**0.5)
     wtd_p95 = mean_wtd + serr_wtd
     wtd_m95 = mean_wtd - serr_wtd
-    terr_wtd = t_conf_int(t_size, sd_wtd)
     wtd_t_p95 = mean_wtd + terr_wtd
     wtd_t_m95 = mean_wtd - terr_wtd

-    mean_sumd = statistics.mean(lst_summed_mdns)
-    sd_sumd = statistics.stdev(lst_summed_mdns)
-    serr_sumd = 1.96 * sd_sumd / (t_rpt**0.5)
     sumd_p95 = mean_sumd + serr_sumd
     sumd_m95 = mean_sumd - serr_sumd
-    terr_sumd = t_conf_int(t_size, sd_sumd)
     sumd_t_p95 = mean_sumd + terr_sumd
     sumd_t_m95 = mean_sumd - terr_sumd

-    print(f"Random sample of {t_size} countries, repeated {t_rpt} times")
-    # print(f"\n{'Simple mean:':14} {mean_means:.2f}, std dev: {sd_means:.2f}, 95%: {means_m95:.2f} - {means_p95:.2f}")-    # print(f"{'Weighted mean:':14} {mean_wtd:.2f}, std dev: {sd_wtd:.2f}, 95%: {wtd_m95:.2f} - {wtd_p95:.2f}")
-    # print(f"{'Summed median:':14} {mean_sumd:.2f}, std dev: {sd_sumd:.2f}, 95%: {sumd_m95:.2f} - {sumd_p95:.2f}")
-    # print(f"{'World median:':14} {w_median:.2f}")
-
-    print(f"\n{'':14} | {'Avg':>5} | {'std dev'} | {'norm 95% conf':13} | {'t 95% conf':13}")
-    print(f"{'':14} | {'':5} | {'':7} | n={t_rpt:<11} | n={t_size:<11}")
-    print(f"\n{'Simple mean:':14} | {mean_means:.2f} | {sd_means:7.2f} | {means_m95:.2f} - {means_p95:.2f} | {means_t_m95:.2f} - {means_t_p95:.2f}")
-    print(f"{'Weighted mean:':14} | {mean_wtd:.2f} | {sd_wtd:7.2f} | {wtd_m95:.2f} - {wtd_p95:.2f} | {wtd_t_m95:.2f} - {wtd_t_p95:.2f}")
-    print(f"{'Summed median:':14} | {mean_sumd:.2f} | {sd_sumd:7.2f} | {sumd_m95:.2f} - {sumd_p95:.2f} | {sumd_t_m95:.2f} - {sumd_t_p95:.2f}")
-    print(f"{'World median:':14} | {w_median:.2f}")
-
-
     es_tm = time.time()
     diff_tm = es_tm - ed_tm
     print(f"\ngenerating sample stats took {diff_tm:.2f} ({ed_tm:.6f}-{es_tm:.6f})")

-    h_lists = [lst_means, lst_wt_means, lst_summed_mdns]
+    print(f"\nTest #{n_tst}: {tst_desc[n_tst]}\n")

-    if n_tst == 1:
-      print(f"\nTest #{n_tst}: {tst_desc[n_tst]}\n")
+    # display appropriate data in terminal window
+    if t_rpt == 1:
+      display_stats_1_sample()
+    else:
+      display_stats_w_rpts()

+    if n_tst == 1:
       fig, axs = plt.subplots(2, 2, sharey=True, gridspec_kw={'hspace': .3}, figsize=(15,8))
       axs[1, 1].axis('off')
       fig.suptitle(f"Estimated World Average Age: {use_yr}")
@@ -324,14 +392,12 @@ if __name__ == '__main__':
           p_row += 1

     elif n_tst == 2:
-      print(f"\nTest #{n_tst}: {tst_desc[n_tst]} -- NOT CURRENTLY AVAILABLE\n")
+      pass

     elif n_tst == 3:
-      print(f"\nTest #{n_tst}: {tst_desc[n_tst]} -- NOT CURRENTLY AVAILABLE\n")
+      pass

     elif n_tst == 4:
-      #print(f"\nTest #{n_tst}: {tst_desc[n_tst]} -- under development\n")
-      #print('\n', h_lists)
       z_95s = [(means_m95, means_p95, mean_means), (wtd_m95, wtd_p95, mean_wtd), (sumd_m95, sumd_p95, mean_sumd)]
       t_95s = [(means_t_m95, means_t_p95, mean_means), (wtd_t_m95, wtd_t_p95, mean_wtd), (sumd_t_m95, sumd_t_p95, mean_sumd)]
       for i in range(len(smpl_types)):
@@ -355,9 +421,10 @@ if __name__ == '__main__':
         print(f"{(in_t_95_low + in_t_95_high)/t_rpt*100:.2f}% of samples fall within t 95% confidence interval for {smpl_types[i]}")
         print(f"\t{in_t_95_low} below the mean and {in_t_95_high} above")

-    elif n_tst == max_tst:
-      print(f"\nTest #{n_tst}: {tst_desc[n_tst]}\n")
+    elif n_tst == 5:
+      pass

+    elif n_tst == max_tst:
       fig, ax = plt.subplots(figsize=(10,6))
       fig.suptitle(f"Estimated World Average Age: {use_yr}")
       plot_hist(ax, h_lists[t_stat], use_yr, smpl_types[t_stat], t_size, t_rpt)

And, some output.

# repeated samples case

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py
loading CSV and getting data from population database took 4.43 (1606407853.877998-1606407858.312385)
Random sample of 5 countries, repeated 100 times

               |   Avg | std dev | norm 95% conf | t 95% conf
               |       |         | n=100         | n=5

Simple mean:   | 27.79 |    4.49 | 26.91 - 28.67 | 22.22 - 33.36
Weighted mean: | 28.38 |    7.37 | 26.93 - 29.82 | 19.22 - 37.53
Summed median: | 28.08 |    7.36 | 26.64 - 29.52 | 18.94 - 37.22
World median:  | 28.68

generating sample stats took 0.13 (1606407858.312385-1606407858.443658)
Test #1: Plot Histograms of 3 Types of Sample Data

whole sampling process took 4.81

#
# single sample case

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -r 1 -c 20
loading CSV and getting data from population database took 4.50 (1606408353.237261-1606408357.732792)
Random sample of 20 countries, one time

               |   Avg | std dev | std err | t 95% conf
               |       |         |         | n=20

Simple mean:   | 29.17 |    8.63 |    4.04 | 25.13 - 33.21
Weighted mean: | 26.83 |    6.39 |    2.99 | 23.84 - 29.82
Summed median: | 26.56 |    6.39 |    2.99 | 23.57 - 29.55
World median:  | 28.68

generating sample stats took 0.00 (1606408357.732792-1606408357.732792)

Test #1: Plot Histograms of 3 Types of Sample Data

Traceback (most recent call last):
  File "R:\learn\py_play\population\est_avg_age.py", line 392, in <module>
    plot_hist(axs[p_row, i % 2], h_lists[i], use_yr, smpl_types[i], t_size, t_rpt)
  File "R:\learn\py_play\population\est_avg_age.py", line 152, in plot_hist
    s_mean = statistics.mean(h_data)
  File "E:\appDev\Miniconda3\envs\base-3.8\lib\statistics.py", line 315, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

So, now we move onto sorting the plotting issue.

Modify Histogram Plotting Function

I again thought about modifying the existing plot function, plot_hist(). And, once again thought otherwise. I will write a new function that works for our single sample case. If I like it and it is flexible enough, I may change the existing code to use the new function. Because I don’t plan to do any of the statistics calculations in the new function I’ve decided to call it plot_hist_basic(). Any and all values to be plotted will be passed to the function. All it will do is plot the basic histogram and add lines for any additional values passed in. Not quite sure how to handle the latter, but I am sure we can sort something out — hopefully something fairly flexible.

To start with I’ll just plot the histogram, a title and some axis labels.

# initial version of the new function

def plot_hist_basic(ax, h_data):
  x_def = 'Age'
  y_def = 'Count'
  ttl_def = f"Simulation: {len(h_data)} countries, no repeats"

  ax.set_xlabel(x_def)
  ax.set_ylabel(y_def)
  ax.set_title(ttl_def)

  n, bins, patches = ax.hist(h_data, bins=range(0, 60))

# code in new/temporary test block (will definitely change)

    elif n_tst == 5:
      fig, ax = plt.subplots(figsize=(10,6))
      fig.suptitle(f"Estimated World Average Age for {use_yr} ({t_size} countries, one time)")

      plot_hist_basic(ax, list(s_medians.values()))

      plt.show()

And, the output, terminal and plot, looks like the following.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 5 -r 1 -c 20
loading CSV and getting data from population database took 4.48 (1606608222.524516-1606608227.002165)

generating sample stats took 0.01 (1606608227.002165-1606608227.009165)

Test #5: Generate and plot a single sample of the specified size

Random sample of 20 countries, one time

               |   Avg | std dev | std err | t 95% conf
               |       |         |         | n=20

Simple mean:   | 28.92 |    8.86 |    4.15 | 24.77 - 33.06
Weighted mean: | 31.52 |   10.97 |    5.13 | 26.39 - 36.65
Summed median: | 32.79 |   10.97 |    5.13 | 27.65 - 37.92
World median:  | 28.68

whole sampling process took 4.65
image showing base histogram generated by the new function as it currently stands

And, there we go — another bug/issue/problem — “Simulation: 18 countries, no repeats‽‽ The code I currently have in the sampling function does not ensure the same country is not selected more than once for any given execution. I really don’t want this single sample to be done with replacement. Something we will need to deal with before moving on too much further. But for now, I will focus on adding the plotting of the mean and the confidence intervals and allowing the title and axis labels to be changed. I have an idea how to go about this. But as usual, not sure it is the best solution. But for now…

**kwargs

My plan is to set some defaults for certain plot items in the function. Then use **kwargs to allow those values to be overridden by keyworded arguments. We can also include additional keyworded arguments to set other variables to be used for plotting lines on the chart. In the code that calls the function, I will create a dictionary with the items I wish overridden or added, then pass it to the function using the splat operator on the dictionary. Could likely just pass/accept the dictionary as an optional argument. But this seemed like more fun.

I assume you have written your version. Now you can criticize my attempt. Oh yes, I also changed the line-style for the confidence interval as it wasn’t showing well in the charts I saw during testing. Function is a bit long, but…

Also do note the use of the get() function to cover the cases where a given argument is not in the dictionary provided by **kwargs.

# new function

def plot_hist_basic(ax, h_data, **kwargs):
  # Defaults
  x_def = 'Age'
  y_def = 'Count'
  ttl_def = f"Simulation: {len(h_data)} countries, no repeats"
  mean_lbl_def = "Est. Average Age"
  do_lgnd = False

  # Optional keyworded arguments and derived values
  x_lbl = kwargs.get('x_lbl', x_def)
  y_lbl = kwargs.get('y_lbl', y_def)
  p_ttl = kwargs.get('title', ttl_def)
  s_mean = kwargs.get('mean', None)
  mean_lgnd = kwargs.get('mean_lbl', mean_lbl_def)
  s_stdev = kwargs.get('stdev', None)
  s_std_err = kwargs.get('std_err', None)
  m95 = None
  p95 = None
  if s_mean and s_std_err:
    m95 = s_mean - s_std_err
    p95 = s_mean + s_std_err

  # now onto plotting
  ax.set_xlabel(x_lbl)
  ax.set_ylabel(y_lbl)
  ax.set_title(p_ttl)

  n, bins, patches = ax.hist(h_data, bins=range(0, 60))

  if s_mean:
    ax.axvline(s_mean, 0, 1, color='r', label=f'{mean_lgnd}: {s_mean:.2f}')
  if m95:
    ax.axvline(m95, 0, 1, color='hotpink', ls='-.', label=f'min 95% conf interval: {m95:.2f}')
  if p95:
    ax.axvline(p95, 0, 1, color='orange', ls='-.', label=f'max 95% conf interval: {p95:.2f}')
  # not actually plotting anything as such, just want value to show in legend area
  if s_stdev:
    ax.axvline(s_stdev, 0, 0, color='w', label=f'std dev: {s_stdev:.2f}')
  # not actually plotting anything as such, just want value to show in legend area
  if s_std_err:
    ax.axvline(s_std_err, 0, 0, color='w', label=f'std err: {s_std_err:.2f}')
  # ax.axvline(w_median, 0, 0, color='k', label=f'World Average: {w_median:.2f}')
  # display legend if appropriate
  if s_mean or s_stdev or s_std_err:
    ax.legend()

# and in the 'if n_tst == 5:' block

    elif n_tst == 5:
      fig, ax = plt.subplots(figsize=(10,6))
      fig.suptitle(f"Estimated World Average Age for {use_yr} ({t_size} countries, one time)")

      kargs = {
        'title': '',
        'x_lbl': f'Average Age (Median Age)',
        'mean': mean_means,
        'm_lbl': "Est. Average Age (weighted mean)",
        'stdev': sd_means,
        'std_err': terr_means
        }
      plot_hist_basic(ax, list(s_medians.values()), **kargs)

Which results in something like:

(base) PS R:\learn\py_play> conda activate base-3.8
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 5 -c 20 -r 1 -s 2
loading CSV and getting data from population database took 4.41 (1606777137.473756-1606777141.879965)

generating sample stats took 0.01 (1606777141.879965-1606777141.886964)

Test #5: Generate and plot a single sample of the specified size

Random sample of 20 countries, one time

               |   Avg | std dev | std err | t 95% conf
               |       |         |         | n=20

Simple mean:   | 29.47 |    8.84 |    4.14 | 25.33 - 33.61
Weighted mean: | 26.61 |    7.80 |    3.65 | 22.96 - 30.26
Summed median: | 25.74 |    7.80 |    3.65 | 22.09 - 29.39
World median:  | 28.68
image showing histogram with additional lines for a single sample of 20 countries

No Replacement

And back to our little bug. In my earlier development/test code, I was keeping a separate list of selected countries and skipping any that were already in the list. I also kept a count of successfully sampled countries, stopping the loop when I had the desired number. I didn’t include that code in either of the ‘get a sample’ functions in this module. A mistake I didn’t realize I was making until just recently.

As you may recall, the df.sample() function allows us to specify the number of items to randomly select when called. And, it turns out, it also has a keyword parameter, replace: bool, default False, that determines whether the selection should be done with or without replacement. The default as you can see suits us perfectly.

So I am going to use sample() to get the desired number of countries then loop over the returned dataframe to generate the desired sample data. I am modifying both functions, get_sample() and get_single_sample(). Will only include the changed code for the latter.

def get_single_sample(r_nbr):
  global df_country, c_data
  c_medians = {}
  c_totals = {}
  wt_mean = 0
  sum_median = 0
  c_pops = []
  sum_pop = [0 for _ in range(21)]
  
  rand_c = df_country.sample(r_nbr)
  for _, row in rand_c.iterrows():
    c_nm = row['valid']
    s_data = c_data[c_nm]
    c_pops.append(s_data)
    c_totals[c_nm] = sum(s_data)
    c_median = dstat.get_wt_quartile(_L_LOWS, s_data)
    c_medians[c_nm] = c_median

  wt_medians = [c_medians[nm] * c_totals[nm] for nm in c_medians.keys()]
  wt_mean = sum(wt_medians) / sum(c_totals.values())
  sum_pop = [sum(n) for n in zip(*c_pops)]
  sum_median = dstat.get_wt_quartile(_L_LOWS, sum_pop)

  return c_medians, c_totals, wt_mean, sum_median

Temporary Experiment

Finally, for this post, I am going to temporarily modify “experiment/test 5” to display the various stats for a single sample of the specified size on a 2x2 figure. The first plot will show the sampled medians and the mean, 95% confience intervals for the “simple mean”. The second for the “weighted mean” and the third for the “median of the summed country populations”. I am using the standard deviation and the standard error for the weighted mean for the “summed median” as with a single data point there is no way to calculate the standard deviation.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 5 -c 20 -r 1
loading CSV and getting data from population database took 4.41 (1606780833.761246-1606780838.169694)

generating sample stats took 0.01 (1606780838.169694-1606780838.174694)

Test #5: Generate and plot a single sample of the specified size

Random sample of 20 countries, one time

               |   Avg | std dev | std err | t 95% conf
               |       |         |         | n=20

Simple mean:   | 29.65 |    9.27 |    4.34 | 25.31 - 33.98
Weighted mean: | 30.60 |    8.65 |    4.05 | 26.55 - 34.65
Summed median: | 30.19 |    8.65 |    4.05 | 26.14 - 34.23
World median:  | 28.68

whole sampling process took 4.67
image showing 2x2 figure of 3 histograms for a single sample of 20 countries

And the code:

    elif n_tst == 5:
      fig, axs = plt.subplots(2, 2, figsize=(15,9), sharey=True)
      fig.suptitle(f"Estimated World Average Age for {use_yr} ({t_size} countries, one time)")

      kargs = {
        'title': '',
        'x_lbl': f'Average Age (Median Age)',
        'mean': mean_means,
        'm_lbl': "Est. Average Age (simple mean)",
        'stdev': sd_means,
        'std_err': terr_means
        }
      plot_hist_basic(axs[0, 0], list(s_medians.values()), **kargs)

      kargs = {
        'title': '',
        'x_lbl': f'Average Age (Median Age)',
        'mean': mean_wtd,
        'm_lbl': "Est. Average Age (weighted mean)",
        'stdev': sd_wtd,
        'std_err': terr_wtd
        }
      plot_hist_basic(axs[0, 1], list(s_medians.values()), **kargs)

      kargs = {
        'title': '',
        'x_lbl': f'Average Age (Median Age)',
        'mean': mean_sumd,
        'm_lbl': "Est. Average Age (median of sum)",
        'stdev': sd_sumd,
        'std_err': terr_sumd
        }
      plot_hist_basic(axs[1, 0], [mean_sumd], **kargs)

Once Again a Little Short on Progress

My intention was to plot the results of differing sample sizes on that 2x2 grid. But, this post was getting a might long, and I was getting a little lazy. So, took the easy way out and am now calling it quits for this post. The next one will look at the impact of differing sample sizes for the single sample case.

I was planning to publish the next post on the coming Thursday, so as to keep things moving. But, I got into the Advent of Code and have wasted too much time everyday this month trying to solve the puzzles. No real progress on the blog. So, the next post will be on the usual weekly schedule — i.e. Monday a week today. And, I may toss in a post or three covering some of the Advent of Code puzzle solutions — at the expense of the posts related to the current subject.

Resources