Confidence Interval

I don’t recall whether or not I mentioned that I have switched to the t-statistic to calculate the confidence intervals. Because the population distribution for any given country and year is not normal, I gather the Z value is likely not appropriate. In addition the Z-statistic apparently requires that one knows the population standard deviation. For this exercise, that is not the case.

So I wrote a function, t_conf_int(), to give me the standard error based on the t-distribution for the size of my sample (degress of freedom). I am using the scipy package to get the appropriate t-value. Then, I altered all the appropriate functions to use the new confidence interval function to calculate the standard error using the t-statistic. That included, amongst others, all_stat_sng_sample() and display_stats_1_sample().

Note: these changes were already in the code I posted along with the last post, Module Code: est_avg_age.py for Experiment 6 & 7. Some of the code I wrote previously when I started working on the concept for this post is also in that last bit of posted code.

# if not already imported, then you will need to import
from scipy import stats

...

def t_conf_int(s_size, s_sdev, p_interval=95, tails=2):
  # defualt: 95% confidence interval for 2 tail sample
  t = stats.t.ppf(1- ((100-p_interval)/tails/100), s_size-1)
  d_int = t * s_sdev / (s_size**0.5)
  return d_int

Multiple Samples of a Given Size

Basically I am going to carry on from where I left off last post. Recall we plotted the individual samples for 10 runs of 22 samples of 33 countries each. I.E. we plotted the sample means for the 220 samples. And calculated the mean of the 220 samples to compare to the actual population average age (the median in our case). I realize I am mixing up our statistic of interest but good enough for government work.

What I would like to look at, is how things change if I just stick to a given sample size (that is, the number of countries per sample) but change the total number of samples (repetitions). Don’t yet know what numbers I will be using for each. But, I will use at least 2, perhaps 3, different sample sizes. Maybe more. I will likely also use at least 4 different values for the number of times the sample is repeated.

I don’t expect I will show the data generated, just the plots. But, you never know.

For this experiment I am deleting the unused code in my experiment 8 block and adding the new code there.

Basic Start

I am starting by adding the code to get X samples of size N. Then a bit of code to plot the result along with a few extra bits of info. I am also counting how many samples did not have the actual population average (for the given year) in its 95% confidence interval. A lot of this initial code was copied from experiment 7 then edited appropriately. I also made a few changes to the plotting function, plot_hist_basic(). Mostly adding a new kwargs item and some additional code to display the population average age and the count, 95% confidence interval faliures, just mentioned.

Here are the changes and addtiions, with a sample of the chart generated/displayed using the default values.

def plot_hist_basic(ax, h_data, **kwargs):
  # Defaults
  x_def = 'Age'
  y_def = 'Count'
  ttl_def = f"Simulation: {len(h_data)} countries, no repeats"
  mean_lbl_def = "Est. Average Age"
  do_lgnd = False

  # Optional keyworded arguments and derived values
  x_lbl = kwargs.get('x_lbl', x_def)
  y_lbl = kwargs.get('y_lbl', y_def)
  p_ttl = kwargs.get('title', ttl_def)
  s_mean = kwargs.get('mean', None)
  mean_lgnd = kwargs.get('m_lbl', mean_lbl_def)
  s_stdev = kwargs.get('stdev', None)
  s_std_err = kwargs.get('std_err', None)
  m95 = None
  p95 = None
  if s_mean and s_std_err:
    m95 = s_mean - s_std_err
    p95 = s_mean + s_std_err
  pop_mean = kwargs.get('pop_mean', None)
  missed_95 = kwargs.get('ms95', None)

  # now onto plotting
  ax.set_xlabel(x_lbl)
  ax.set_ylabel(y_lbl)
  ax.set_title(p_ttl)

  n, bins, patches = ax.hist(h_data, bins=range(0, 60))

  if s_mean:
    ax.axvline(s_mean, 0, 1, color='r', label=f'{mean_lgnd}: {s_mean:.2f}')
  if m95:
    ax.axvline(m95, 0, 1, color='hotpink', ls='-.', label=f'min 95% conf interval: {m95:.2f}')
  if p95:
    ax.axvline(p95, 0, 1, color='orange', ls='-.', label=f'max 95% conf interval: {p95:.2f}')
  # not actually plotting anything as such, just want value to show in legend area
  if s_stdev:
    ax.axvline(s_stdev, 0, 0, color='w', label=f'std dev: {s_stdev:.2f}')
  # not actually plotting anything as such, just want value to show in legend area
  if s_std_err:
    ax.axvline(s_std_err, 0, 0, color='w', label=f'std err: {s_std_err:.2f}')
  # ax.axvline(w_median, 0, 0, color='k', label=f'World Average: {w_median:.2f}')
  # display legend if appropriate
  if pop_mean:
    ax.axvline(pop_mean, 0, 1, color='aqua', label=f'Population average: {pop_mean:.2f}')
  if missed_95:
    ax.axvline(missed_95, 0, 1, color='w', label=f'Population average not in 95% CI: {missed_95:.2f}')
  if s_mean or s_stdev or s_std_err or pop_mean:
    ax.legend()

...

    elif n_tst == 8:
      world_avg = w_median
      t8_data_file = f't8_samples_{ex_f_nbr}.txt'
      t8_rpt = 30
      t8_size = 10
      t8_stat = 0
      t8_means = []
      t8_stderrs = []

      t8_seed = 349

      if args.stat and (args.stat >= 1 and args.stat <= max_smpl):
        t8_stat = args.stat - 1
      if args.rpts and (args.rpts >= 1 and args.rpts <= max_rpt):
        t8_rpt = args.rpts
      if args.c_cnt and (args.c_cnt >= 1 and args.c_cnt <= max_size):
        t8_size = args.c_cnt

      t8_means = []
      t8_stderrs = []
      tmp_seed = t8_seed
      for i in range(t8_rpt):
        tmp_seed += (i * 409)
        p_medians, p_tots, p_mean, p_sd, p_se = all_stat_sng_sample(t8_size, df_seed=tmp_seed)
        t8_means.append(p_mean[t8_stat])
        t8_stderrs.append(p_se[t8_stat])

      missed_interval = 0
      for i, mean in enumerate(t8_means):
        if world_avg < mean - t8_stderrs[i] or world_avg > mean + t8_stderrs[i]:
          missed_interval += 1
      
      with open(P_PATH / t8_data_file, 'a') as fl_dbug:
        fl_dbug.write("{")
        fl_dbug.write(f"  'parameters': [{t8_size}, {t8_rpt}]")
        fl_dbug.write(f"  'means': {t8_means},\n")
        fl_dbug.write(f"  'stderrs': {t8_stderrs},\n")
        fl_dbug.write(f"  'missed': {missed_interval})\n")
        fl_dbug.write("},\n")
      
      # I am likely going to plot the charts for the 4 differing numbers of repeats on a single figure
      #fig, axs = plt.subplots(2, 2, figsize=(15,9), sharey=True)
      f_wd = 10
      f_ht = 6
      fig, ax = plt.subplots(figsize=(f_wd,f_ht))
  
      fig.suptitle(f"Estimated World Average Age for {t8_rpt} Samples of Size {t8_size}")
      smpl_mean = sum(t8_means) / t8_rpt
      rnd_mean = f"{smpl_mean:.2f}"
      plt_mean = float(rnd_mean)

      kargs = {
        'title': '',
        'mean': plt_mean,
        'm_lbl': f'Mean of all {t8_rpt} samples',
        'pop_mean': world_avg,
        'ms95': missed_interval
        }
      plot_hist_basic(ax, t8_means, **kargs)
histogram showing means for 30 samples of size 10

And, here’s a slightly bigger sample repeated more often.

(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 8 -c 20 -r 50
R:\learn\py_play
loading CSV and getting data from population database took 9.94 (1610901935.680137-1610901945.617773)

generating sample stats took 0.27 (1610901945.617773-1610901945.885316)

Test #8: Compare repeating a sample a differing number of times

Random sample of 20 countries, repeated 50 times

               |   Avg | std dev | norm 95% conf | t 95% conf
               |       |         | n=50          | n=20

Simple mean:   | 28.16 |    2.00 | 27.61 - 28.72 | 27.23 - 29.10
Weighted mean: | 28.76 |    3.84 | 27.70 - 29.83 | 26.97 - 30.56
Summed median: | 28.09 |    3.88 | 27.01 - 29.16 | 26.27 - 29.90
World median:  | 28.68

whole sampling process took 10.84
histogram showing means for 50 samples of size 20

The results certainly look to be a fair bit more normally distributed than some of our earlier distributions.

That’s It For Today

Sorry! I have mentioned life and Advent of Code have seriously mangled my productivity with respect to the blog and this particular exercise in coding and statistics. That’s for the whole of December and so far this new year. So, unless I get really productive later today (2021.01.17), the above is likely it for this post.

In the next one, I plan to look at plotting the results for samples of a single size with 4 different repetition sizes. I will likely plot them all on one figure so that it is easier to visually compare them. I plan to continue using a seed for the random sampling function so that the data for each of the four repeats always uses the same data at the front. Figure that will give us a more realistic idea of what increasing the number of repititions actually accomplishes — other than reduce the sample’s standard error and perhaps the standard deviation.

After that, my biggest concern is where I go next. I think I may go back and start over with Data Science Bookcamp. I just don’t know how I can blog about it without risking copyright infringement. We shall see. I may also look at doing my coding for the book in a Jupyter Notebook using VS Code. A new challenge all of its own.

Until next time.

Resources