Confidence Interval
I don’t recall whether or not I mentioned that I have switched to the t-statistic to calculate the confidence intervals. Because the population distribution for any given country and year is not normal, I gather the Z value is likely not appropriate. In addition the Z-statistic apparently requires that one knows the population standard deviation. For this exercise, that is not the case.
So I wrote a function, t_conf_int(), to give me the standard error based on the t-distribution for the size of my sample (degress of freedom). I am using the scipy package to get the appropriate t-value. Then, I altered all the appropriate functions to use the new confidence interval function to calculate the standard error using the t-statistic. That included, amongst others, all_stat_sng_sample() and display_stats_1_sample().
Note: these changes were already in the code I posted along with the last post, Module Code: est_avg_age.py for Experiment 6 & 7. Some of the code I wrote previously when I started working on the concept for this post is also in that last bit of posted code.
# if not already imported, then you will need to import
from scipy import stats
...
def t_conf_int(s_size, s_sdev, p_interval=95, tails=2):
# defualt: 95% confidence interval for 2 tail sample
t = stats.t.ppf(1- ((100-p_interval)/tails/100), s_size-1)
d_int = t * s_sdev / (s_size**0.5)
return d_int
Multiple Samples of a Given Size
Basically I am going to carry on from where I left off last post. Recall we plotted the individual samples for 10 runs of 22 samples of 33 countries each. I.E. we plotted the sample means for the 220 samples. And calculated the mean of the 220 samples to compare to the actual population average age (the median in our case). I realize I am mixing up our statistic of interest but good enough for government work.
What I would like to look at, is how things change if I just stick to a given sample size (that is, the number of countries per sample) but change the total number of samples (repetitions). Don’t yet know what numbers I will be using for each. But, I will use at least 2, perhaps 3, different sample sizes. Maybe more. I will likely also use at least 4 different values for the number of times the sample is repeated.
I don’t expect I will show the data generated, just the plots. But, you never know.
For this experiment I am deleting the unused code in my experiment 8 block and adding the new code there.
Basic Start
I am starting by adding the code to get X samples of size N. Then a bit of code to plot the result along with a few extra bits of info. I am also counting how many samples did not have the actual population average (for the given year) in its 95% confidence interval. A lot of this initial code was copied from experiment 7 then edited appropriately. I also made a few changes to the plotting function, plot_hist_basic(). Mostly adding a new kwargs item and some additional code to display the population average age and the count, 95% confidence interval faliures, just mentioned.
Here are the changes and addtiions, with a sample of the chart generated/displayed using the default values.
def plot_hist_basic(ax, h_data, **kwargs):
# Defaults
x_def = 'Age'
y_def = 'Count'
ttl_def = f"Simulation: {len(h_data)} countries, no repeats"
mean_lbl_def = "Est. Average Age"
do_lgnd = False
# Optional keyworded arguments and derived values
x_lbl = kwargs.get('x_lbl', x_def)
y_lbl = kwargs.get('y_lbl', y_def)
p_ttl = kwargs.get('title', ttl_def)
s_mean = kwargs.get('mean', None)
mean_lgnd = kwargs.get('m_lbl', mean_lbl_def)
s_stdev = kwargs.get('stdev', None)
s_std_err = kwargs.get('std_err', None)
m95 = None
p95 = None
if s_mean and s_std_err:
m95 = s_mean - s_std_err
p95 = s_mean + s_std_err
pop_mean = kwargs.get('pop_mean', None)
missed_95 = kwargs.get('ms95', None)
# now onto plotting
ax.set_xlabel(x_lbl)
ax.set_ylabel(y_lbl)
ax.set_title(p_ttl)
n, bins, patches = ax.hist(h_data, bins=range(0, 60))
if s_mean:
ax.axvline(s_mean, 0, 1, color='r', label=f'{mean_lgnd}: {s_mean:.2f}')
if m95:
ax.axvline(m95, 0, 1, color='hotpink', ls='-.', label=f'min 95% conf interval: {m95:.2f}')
if p95:
ax.axvline(p95, 0, 1, color='orange', ls='-.', label=f'max 95% conf interval: {p95:.2f}')
# not actually plotting anything as such, just want value to show in legend area
if s_stdev:
ax.axvline(s_stdev, 0, 0, color='w', label=f'std dev: {s_stdev:.2f}')
# not actually plotting anything as such, just want value to show in legend area
if s_std_err:
ax.axvline(s_std_err, 0, 0, color='w', label=f'std err: {s_std_err:.2f}')
# ax.axvline(w_median, 0, 0, color='k', label=f'World Average: {w_median:.2f}')
# display legend if appropriate
if pop_mean:
ax.axvline(pop_mean, 0, 1, color='aqua', label=f'Population average: {pop_mean:.2f}')
if missed_95:
ax.axvline(missed_95, 0, 1, color='w', label=f'Population average not in 95% CI: {missed_95:.2f}')
if s_mean or s_stdev or s_std_err or pop_mean:
ax.legend()
...
elif n_tst == 8:
world_avg = w_median
t8_data_file = f't8_samples_{ex_f_nbr}.txt'
t8_rpt = 30
t8_size = 10
t8_stat = 0
t8_means = []
t8_stderrs = []
t8_seed = 349
if args.stat and (args.stat >= 1 and args.stat <= max_smpl):
t8_stat = args.stat - 1
if args.rpts and (args.rpts >= 1 and args.rpts <= max_rpt):
t8_rpt = args.rpts
if args.c_cnt and (args.c_cnt >= 1 and args.c_cnt <= max_size):
t8_size = args.c_cnt
t8_means = []
t8_stderrs = []
tmp_seed = t8_seed
for i in range(t8_rpt):
tmp_seed += (i * 409)
p_medians, p_tots, p_mean, p_sd, p_se = all_stat_sng_sample(t8_size, df_seed=tmp_seed)
t8_means.append(p_mean[t8_stat])
t8_stderrs.append(p_se[t8_stat])
missed_interval = 0
for i, mean in enumerate(t8_means):
if world_avg < mean - t8_stderrs[i] or world_avg > mean + t8_stderrs[i]:
missed_interval += 1
with open(P_PATH / t8_data_file, 'a') as fl_dbug:
fl_dbug.write("{")
fl_dbug.write(f" 'parameters': [{t8_size}, {t8_rpt}]")
fl_dbug.write(f" 'means': {t8_means},\n")
fl_dbug.write(f" 'stderrs': {t8_stderrs},\n")
fl_dbug.write(f" 'missed': {missed_interval})\n")
fl_dbug.write("},\n")
# I am likely going to plot the charts for the 4 differing numbers of repeats on a single figure
#fig, axs = plt.subplots(2, 2, figsize=(15,9), sharey=True)
f_wd = 10
f_ht = 6
fig, ax = plt.subplots(figsize=(f_wd,f_ht))
fig.suptitle(f"Estimated World Average Age for {t8_rpt} Samples of Size {t8_size}")
smpl_mean = sum(t8_means) / t8_rpt
rnd_mean = f"{smpl_mean:.2f}"
plt_mean = float(rnd_mean)
kargs = {
'title': '',
'mean': plt_mean,
'm_lbl': f'Mean of all {t8_rpt} samples',
'pop_mean': world_avg,
'ms95': missed_interval
}
plot_hist_basic(ax, t8_means, **kargs)
And, here’s a slightly bigger sample repeated more often.
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 8 -c 20 -r 50
R:\learn\py_play
loading CSV and getting data from population database took 9.94 (1610901935.680137-1610901945.617773)
generating sample stats took 0.27 (1610901945.617773-1610901945.885316)
Test #8: Compare repeating a sample a differing number of times
Random sample of 20 countries, repeated 50 times
| Avg | std dev | norm 95% conf | t 95% conf
| | | n=50 | n=20
Simple mean: | 28.16 | 2.00 | 27.61 - 28.72 | 27.23 - 29.10
Weighted mean: | 28.76 | 3.84 | 27.70 - 29.83 | 26.97 - 30.56
Summed median: | 28.09 | 3.88 | 27.01 - 29.16 | 26.27 - 29.90
World median: | 28.68
whole sampling process took 10.84
The results certainly look to be a fair bit more normally distributed than some of our earlier distributions.
That’s It For Today
Sorry! I have mentioned life and Advent of Code have seriously mangled my productivity with respect to the blog and this particular exercise in coding and statistics. That’s for the whole of December and so far this new year. So, unless I get really productive later today (2021.01.17), the above is likely it for this post.
In the next one, I plan to look at plotting the results for samples of a single size with 4 different repetition sizes. I will likely plot them all on one figure so that it is easier to visually compare them. I plan to continue using a seed for the random sampling function so that the data for each of the four repeats always uses the same data at the front. Figure that will give us a more realistic idea of what increasing the number of repititions actually accomplishes — other than reduce the sample’s standard error and perhaps the standard deviation.
After that, my biggest concern is where I go next. I think I may go back and start over with Data Science Bookcamp. I just don’t know how I can blog about it without risking copyright infringement. We shall see. I may also look at doing my coding for the book in a Jupyter Notebook using VS Code. A new challenge all of its own.
Until next time.
Resources
- T Statistic: Definition, Types and Comparison to Z Score
- T-Score vs. Z-Score: What’s the Difference?
- Comparing means: z and t tests
- Hypothesis Testing and Z-Test vs. T-Test