Okay, as mentioned last time, I am going to continue playing around with our single sample experiments. What I plan to do is sample and plot the results for 4 different sample sizes. Something like 20, 30, 40, 50 countries in a single sample. Will also likely include a table showing the statistics for each sample to make comparison easier. I will only plot one statistic for each sample.

I am also thinking about generating a number of samples of a given size and seeing how many of them fall within the 95% confidence interval. Though I don’t know what confidence interval to use. Perhaps just that of the first sample. Something to think about and perhaps research. Wish I really knew something about statistics.

Four Sample Sizes Compared

I was going to rework experiment #5 (if n_tst == 5). But, I think I will just add another experiment block (if n_tst == 6). I will be using the t distribution values for the confidence interval calculations. This involves code we’ve seen numerous times already. So, go get it coded and come back when you’re done. Then you can critique my naive approach.

But first let’s look at a couple of sets of results.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -c 20 -r 1 -e 6 -s 1
loading CSV and getting data from population database took 4.36 (1606955485.620786-1606955489.977785)

generating sample stats took 0.00 (1606955489.977785-1606955489.981785)

Test #6: Generate and compare samples of varying sizes

Random sample of varying numbers of countries, simple mean, one time

Sample Size    |   Avg | std dev | std err | t 95% conf
               |       |         |         |
            20 | 28.48 |   10.13 |    4.74 | 23.73 - 33.22
            30 | 28.11 |    8.59 |    3.21 | 24.90 - 31.32
            40 | 29.29 |    9.38 |    3.00 | 26.29 - 32.29
            50 | 27.98 |    8.28 |    2.35 | 25.63 - 30.34
World median:  | 28.68

whole sampling process took 4.67
image showing 2x2 figure of 4 histograms for simple mean of single samples of varying sizes
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -c 20 -r 1 -e 6 -s 2
loading CSV and getting data from population database took 4.44 (1606955582.224079-1606955586.660079)

generating sample stats took 0.00 (1606955586.660079-1606955586.664079)

Test #6: Generate and compare samples of varying sizes

Random sample of varying numbers of countries, weighted mean, one time

Sample Size    |   Avg | std dev | std err | t 95% conf
               |       |         |         |
            20 | 30.43 |   11.21 |    5.25 | 25.18 - 35.67
            30 | 27.23 |    8.64 |    3.23 | 24.01 - 30.46
            40 | 30.92 |    7.87 |    2.52 | 28.40 - 33.43
            50 | 26.26 |    9.61 |    2.73 | 23.53 - 28.99
World median:  | 28.68

whole sampling process took 4.75
image showing 2x2 figure of 4 histograms for weighted mean of single samples of varying sizes

What Do We Know

Not too sure what knowledge can be drawn from the above sample runs.

That said, none of the samples look particularly normally distributed. The simple mean seemed to provide a better estimate than the weighted mean for similar sample sizes. As expected, as the number of countries in the sample went up the standard error went down — simply due to the fact the divisor is based on the number of samples. The one exception being the 50 count sample using the weighted mean to estimate the average world age.

But in that case the standard deviation took a fairly large jump up over the 40 count sample. And for this set of samples the 95% confidence interval did cover the actual world average age for all sample sizes and the estimator used to determine the world average age. Strangely, to me, the 50 count sample in both cases seemed to provide a poorer estimate than the smaller samples.

Code

Oh, yes my code for this experiment, #6 in my case. For the sample sizes I took whatever the user supplied, or the default if not supplied, as the first sample size. Then incremented by 10 for each of 3 more samples. Seemed a simple enough approach, with a touch of flexibility. I also created a new function, all_stat_sng_sample(c_nbr), to get the numbers I needed for each sample size. Existing code just wasn’t going to provide those numbers. But, I still only read the data from file once.

# the new function

def all_stat_sng_sample(r_nbr):
  s_medians, s_totals, mean_wtd, mean_sumd = get_single_sample(r_nbr)

  mean_means = statistics.mean(s_medians.values())
  sd_means = statistics.stdev(s_medians.values())
  terr_means = t_conf_int(r_nbr, sd_means)

  sd_wtd = np.sqrt(np.cov(list(s_medians.values()), aweights=list(s_totals.values())))
  terr_wtd = t_conf_int(r_nbr, sd_wtd)

  sd_sumd = sd_wtd
  terr_sumd = t_conf_int(r_nbr, sd_sumd)

  p_means = [mean_means, mean_wtd, mean_sumd]
  p_sds = [sd_means, sd_wtd, sd_sumd]
  p_ses = [terr_means, terr_wtd, terr_sumd]

  return s_medians, s_totals, p_means, p_sds, p_ses

...

# I also suppressed printing to the terminal of data not related to experiment #6
    if n_tst != 6:
      if t_rpt == 1:
        display_stats_1_sample()
      else:
        display_stats_w_rpts()

...

    elif n_tst == 6:
      fig, axs = plt.subplots(2, 2, figsize=(15,12), sharey=True)
      fig.suptitle(f"Estimated World Average Age for {use_yr} (various sample sizes, one time)")
      ttl_stat = ['simple mean', 'weighted mean', 'median of sum']

      print(f"Random sample of varying numbers of countries, {ttl_stat[t_stat]}, one time")

      print(f"\n{'Sample Size':14} | {'Avg':>5} | {'std dev'} | {'std err':7} | {'t 95% conf':13}")
      print(f"{'':14} | {'':5} | {'':7} | {'':7} | {'':<11}")

      r_nbr = 0
      for i in range(4):
        # get the appropriate row number for the axes to plot
        if i > 0 and i % 2 == 0:
          r_nbr += 1
        # get sample size for this plot  
        c_nbr = t_size + (i * 10)

        p_medians, p_tots, p_mean, p_sd, p_se = all_stat_sng_sample(c_nbr)

        m95 = p_mean[t_stat] - p_se[t_stat]
        p95 = p_mean[t_stat] + p_se[t_stat]
        int_95 = f"{m95:.2f} - {p95:.2f}"

        print(f"{c_nbr:14} | {p_mean[t_stat]:5.2f} | {p_sd[t_stat]:7.2f} | {p_se[t_stat]:7.2f} | {int_95}")

        kargs = {
          'title': f'Sample of {len(p_medians)} countries',
          'x_lbl': f'Average Age (Median Age)',
          'mean': p_mean[t_stat],
          'm_lbl': f"Est. Average Age ({ttl_stat[t_stat]})",
          'stdev': p_sd[t_stat],
          'std_err': p_se[t_stat]
          }

        if t_stat == 0:
          plot_hist_basic(axs[r_nbr, i % 2], list(p_medians.values()), **kargs)
        elif t_stat == 1:
          plot_hist_basic(axs[r_nbr, i % 2], list(p_medians.values()), **kargs)
        elif t_stat == 2:
          plot_hist_basic(axs[r_nbr, i % 2], [p_mean[t_stat]], **kargs)

      print(f"{'World median:':14} | {w_median:.2f}")

Hope you aren’t laughing too hard.

Ah! Another Delay

I had plans to continue this post with a section comparing the output of a number of runs of a given sample size. And, seeing how that 95% confidence interval might look/work. Including plotting them of course. This might take a bit of messing about on my part, involving perhaps some further research.

As I got carried away and spent too much time each day during December trying to solve the daily puzzles in this year’s Advent of Code, I got rather behind on this project (the blogging and the coding). So, I think I am just going to call it quits for this post. And, with a little luck, I will hopefully get back into things given the holiday season is pretty much done.

So, why don’t you give it a go, and see you back here next week.

Resources