I really do hope I finally get this finished in some reasonable fashion.
Bit of a Recovery Mission
As I continued coding experiment 8, I modified the code to generate a single figure with four charts (one for each value of the number of repetitions of that particular sample size). I slowly realized that for my blogging purposes, I might still want to generate a single plot showing the histogram for a single set of repeated samples. I also wanted to maintain control of the seed for the random sampling. That was not something any of the other experiments allowed me to do. And, I had already committed some of the changes.
I did think about modifying the code to allow me to tell experiment 8 what I wanted, single chart or 2x2, and have it proceed appropriately. But decided that cluttered the code up too much and would probably require an additional command line parameter. So, I decided I’d move my current experiment 8 code to an experiment 9. Then put my old code back in experiment 8. But, I didn’t have my old code? Or did I?
Git to the rescue. Specifically git show
. This command does a great many things (do check out the documentation), but I was only interested in having it show me the experiment 8 code from my 2nd last commit — HEAD~1
. I have to use head
as I am on a branch. Using master
will not get me what I want. And, I didn’t want to display it all in a command window. Then have to hunt and copy/paste from there. Figured it would be easier if I piped the output into a file. So:
PS R:\learn\py_play> git show HEAD~1:population/est_avg_age.py > population/play/old_est_avg_age.py
And voilà, I have a file containing all the code for est_avg_age.py prior to my last commit. Perfect. Once I copied things over I deleted the file. As it was not covered by my .gitignore, git kept telling me it was untracked.
I also wanted to be able to display the 95% confidence interval for all the repeated samples. So, I added another kwarg and modified the code for plot_hist_basic() and experiment 8 accordingly. May or may not add that to the charts in epxeriment 9.
And, should you be interested, here’s my git diff for the above changes.
PS R:\learn\py_play> git diff head^ head
diff --git a/population/est_avg_age.py b/population/est_avg_age.py
index c6258e3..3643f7f 100644
--- a/population/est_avg_age.py
+++ b/population/est_avg_age.py
@@ -240,6 +240,7 @@ def plot_hist_basic(ax, h_data, **kwargs):
ttl_def = f"Simulation: {len(h_data)} countries, no repeats"
mean_lbl_def = "Est. Average Age"
do_lgnd = False
+ rpts = len(h_data)
# Optional keyworded arguments and derived values
x_lbl = kwargs.get('x_lbl', x_def)
@@ -256,6 +257,7 @@ def plot_hist_basic(ax, h_data, **kwargs):
p95 = s_mean + s_std_err
pop_mean = kwargs.get('pop_mean', None)
missed_95 = kwargs.get('ms95', None)
+ r_serr = kwargs.get('se', None)
# now onto plotting
ax.set_xlabel(x_lbl)
@@ -277,12 +279,20 @@ def plot_hist_basic(ax, h_data, **kwargs):
if s_std_err:
ax.axvline(s_std_err, 0, 0, color='w', label=f'std err: {s_std_err:.2f}')
# ax.axvline(w_median, 0, 0, color='k', label=f'World Average: {w_median:.2f}')
- # display legend if appropriate
+ if r_serr:
+ if s_mean:
+ lw95 = s_mean - r_serr
+ ax.axvline(lw95, 0, 1, color='hotpink', ls='-.', label=f'Std err for all {rpts} samples: {r_serr:.2f}')
+ hg95 = s_mean + r_serr
+ ax.axvline(hg95, 0, 1, color='hotpink', ls='-.')
+ else:
+ ax.avline(r_serr,0, 1, color='w', label=f'Std err for all {rpts} samples: {r_serr:.2f}')
if pop_mean:
ax.axvline(pop_mean, 0, 1, color='aqua', label=f'Population average: {pop_mean:.2f}')
if missed_95:
- ax.axvline(missed_95, 0, 1, color='w', label=f'Population average not in 95% CI: {missed_95:.2f}')
- if s_mean or s_stdev or s_std_err or pop_mean:
+ ax.axvline(missed_95, 0, 1, color='w', label=f'Pop. average not in single sample 95% CI: {missed_95}')
+ # display legend if appropriate
+ if s_mean or s_stdev or s_std_err or pop_mean or r_serr:
ax.legend()
@@ -318,7 +328,8 @@ if __name__ == '__main__':
"Generate and plot a single sample of the specified size",
"Generate and compare samples of varying sizes",
"Compare repeated samples of a single size",
- "Compare repeating a single sample size a differing number of times",
+ "Generate and plot means for samples of size -c repeated -r times"
+ "Compare repeating a single sample (size -c) a differing number of times",
# following should always be the last test
"Plot Single Statistic, One Histogram"
]
@@ -714,6 +725,87 @@ if __name__ == '__main__':
autolabel(rects1, "left")
elif n_tst == 8:
+ world_avg = w_median
+ t8_data_file = f't8_samples_{ex_f_nbr}.txt'
+ t8_rpt = 30
+ t8_size = 10
+ t8_stat = 0
+ t8_means = []
+ t8_stderrs = []
+
+ t8_seed = 349
+
+ if args.stat and (args.stat >= 1 and args.stat <= max_smpl):
+ t8_stat = args.stat - 1
+ if args.rpts and (args.rpts >= 1 and args.rpts <= max_rpt):
+ t8_rpt = args.rpts
+ if args.c_cnt and (args.c_cnt >= 1 and args.c_cnt <= max_size):
+ t8_size = args.c_cnt
+
+ t8_means = []
+ t8_stderrs = []
+ tmp_seed = t8_seed
+ for i in range(t8_rpt):
+ tmp_seed += (i * 409)
+ p_medians, p_tots, p_mean, p_sd, p_se = all_stat_sng_sample(t8_size, df_seed=tmp_seed)
+ t8_means.append(p_mean[t8_stat])
+ t8_stderrs.append(p_se[t8_stat])
+
+ missed_interval = 0
+ for i, mean in enumerate(t8_means):
+ if world_avg < mean - t8_stderrs[i] or world_avg > mean + t8_stderrs[i]:
+ missed_interval += 1
+
+ with open(P_PATH / t8_data_file, 'a') as fl_dbug:
+ fl_dbug.write("{")
+ fl_dbug.write(f" 'parameters': [{t8_size}, {t8_rpt}]")
+ fl_dbug.write(f" 'means': {t8_means},\n")
+ fl_dbug.write(f" 'stderrs': {t8_stderrs},\n")
+ fl_dbug.write(f" 'missed': {missed_interval})\n")
+ fl_dbug.write("},\n")
+
+ #fig, axs = plt.subplots(2, 2, figsize=(15,9), sharey=True)
+ f_wd = 10
+ f_ht = 6
+ fig, ax = plt.subplots(figsize=(f_wd,f_ht))
+ # Defaults
+ # x_def = 'Age'
+ # y_def = 'Count'
+ # ttl_def = f"Simulation: {len(h_data)} countries, no repeats"
+ # mean_lbl_def = "Est. Average Age"
+ # do_lgnd = False
+
+ # # Optional keyworded arguments and derived values
+ # x_lbl = kwargs.get('x_lbl', x_def)
+ # y_lbl = kwargs.get('y_lbl', y_def)
+ # p_ttl = kwargs.get('title', ttl_def)
+ # s_mean = kwargs.get('mean', None)
+ # mean_lgnd = kwargs.get('m_lbl', mean_lbl_def)
+ # s_stdev = kwargs.get('stdev', None)
+ # s_std_err = kwargs.get('std_err', None)
+ # pop_mean = kwargs.get('pop_mean', None)
+ # missed_95 = kwargs.get('ms95', None)
+ # r_serr = kwargs.get('se', None)
+
+ fig.suptitle(f"Estimated World Average Age for {t8_rpt} Samples of Size {t8_size}")
+ smpl_mean = sum(t8_means) / t8_rpt
+ rnd_mean = f"{smpl_mean:.2f}"
+ plt_mean = float(rnd_mean)
+ t8_mean = statistics.mean(t8_means)
+ t8_stdev = statistics.stdev(t8_means)
+ t8_se = t_conf_int(t8_size, t8_stdev, p_interval=95, tails=2)
+
+ kargs = {
+ 'title': '',
+ 'mean': plt_mean,
+ 'm_lbl': f'Mean of all {t8_rpt} samples',
+ 'pop_mean': world_avg,
+ 'ms95': missed_interval,
+ 'se': t8_se
+ }
+ plot_hist_basic(ax, t8_means, **kargs)
+
+ elif n_tst == 9:
world_avg = w_median
t8_data_file = f't8_samples_{ex_f_nbr}.txt'
# for the base 4 example case, the nbr of rpts will be:
Back to the Code for Experiment 9
I have got the code together to plot 4 charts (2 x 2 on one figure) each showing a differing number of repetitions for a sample of a given size. The sample means are being plotted. My defaults are a sample size of 10 and with repetitions of 30, 50, 70 and 100. And, being rather lazy, I did not change my variable names to reflect the change in experiment number. So the experiment 9 code has numerous variables labelled t8_*.
I have also changed my code to, in each execution, select a somewhat different seed for the random sampling based on the current time.
t8_seed = 13 * (int(time.time()) % int(use_yr))
That said, I have command line paramters allowing me to change the sample size (-c), the number of repetitions (-r) and the default seed value (-m). I reused a parameter for the latter from a different experiment (#7) for this one (#9).
And, of course we are back to the multi-axes form for initializing our matplotlib figure:
fig, axs = plt.subplots(2, 2, figsize=(15,9), sharey=True, sharex=True)
And, since I will be using all four positions, I am sharing the x and y axis labels. And, I am using our previously discussed code to get the appropriate axis for each chart.
I am determining the various repetition values based on multipliers and the current value for the t8_rpt variable. Either 30, the default, or whatever was set via the command line paramter. The multiplier array consists of the following multipliers (chosen to get 30, 50, 70, 100 for the default value of 30):
t8_r_mult = [1, 1.66, 2.33, 3.33]
And, the calculation of the actual values looks like:
# in my variable defintions at the top of the experiment block I have the following
t8_base = 10
if t8_rpt % 5 == 0:
t8_base = 5
...
# further down I sort out the 4 repetition amounts to be used
t8_rpts = [t8_rpt]
for i in range(1, 4):
t8_rpts.append(int(round(t8_rpt * t8_r_mult[i] / t8_base) * t8_base))
Bit of fooling around on my part. Not of any particular value to the code or experiment. If the user supplies a new initial repeat value ending in 5, all the generated values will be multiples of 5. Otherwise they will be multiples of 10.
As mentioned above there were also some changes to the histogram plotting function so that I could add extra details to the charts should I wish to do so.
Here’s the output for the default situation:
(base-3.8) PS R:\learn\py_play> python R:\learn\py_play\population\est_avg_age.py -e 9
loading CSV and getting data from population database took 9.77 (1611083478.398643-1611083488.165751)
generating sample stats took 0.22 (1611083488.165751-1611083488.382161)
Test #9: Compare repeating a single sample (size -c) a differing number of times
Initial seed value: 13039
whole sampling process took 11.84
You will note that the mean age for each of the 4 sample distributions is very close to the actual world (population) average. This will of course vary from execution to execution. But will always produce this result if the initial seed variable is set to 13039.
Now, if I add the standard error for each set of repeated sample means, it looks like the following. You will note I used the command line parameter to specify the same seed the first chart used.
(base-3.8) PS R:\learn\py_play> python R:\learn\py_play\population\est_avg_age.py -e 9 -m 13039
loading CSV and getting data from population database took 9.95 (1611084115.761430-1611084125.708925)
generating sample stats took 0.16 (1611084125.708925-1611084125.864962)
Test #9: Compare repeating a single sample (size -c) a differing number of times
Initial seed value: 13039
whole sampling process took 12.17
And here’s my code for experimetn #9. Note, I once again save the sampling data to a file so that I have it available should I wish to do other things with it for inclusion in this or another post. You likely don’t need to do that.
elif n_tst == 9:
world_avg = w_median
t8_data_file = f't9_samples_{ex_f_nbr}.txt'
# for the base 4 example case, the nbr of rpts will be:
t8_rpt = 30
t8_size = 10
t8_stat = 0
t8_means = []
t8_stderrs = []
t8_base = 10
if t8_rpt % 5 == 0:
t8_base = 5
t8_r_mult = [1, 1.66, 2.33, 3.33]
t8_seed = 13 * (int(time.time()) % int(use_yr))
if args.stat and (args.stat >= 1 and args.stat <= max_smpl):
t8_stat = args.stat - 1
if args.rpts and (args.rpts >= 1 and args.rpts <= max_rpt):
t8_rpt = args.rpts
if args.c_cnt and (args.c_cnt >= 1 and args.c_cnt <= max_size):
t8_size = args.c_cnt
# going to use this cmd line arg (-m) with exp 8 to allow me to specify initial seed value
if args.e7_multiples:
t8_seed = args.e7_multiples
print(f"Initial seed value: {t8_seed}")
fig, axs = plt.subplots(2, 2, figsize=(15,9), sharey=True, sharex=True)
fig.suptitle(f"Estimated World Average Age for Repeated Samples of Size {t8_size}")
t8_rpts = [t8_rpt]
for i in range(1, 4):
t8_rpts.append(int(round(t8_rpt * t8_r_mult[i] / t8_base) * t8_base))
r_nbr = 0
for r in range(4):
# sort axes for plot
if r > 0 and r % 2 == 0:
r_nbr += 1
c_nbr = r % 2
t8_r_tmp = t8_rpts[r]
t8_means = []
t8_stderrs = []
tmp_seed = t8_seed
for i in range(t8_r_tmp):
tmp_seed += (i * 409)
p_medians, p_tots, p_mean, p_sd, p_se = all_stat_sng_sample(t8_size, df_seed=tmp_seed)
t8_means.append(p_mean[t8_stat])
t8_stderrs.append(p_se[t8_stat])
missed_interval = 0
for i, mean in enumerate(t8_means):
if world_avg < mean - t8_stderrs[i] or world_avg > mean + t8_stderrs[i]:
missed_interval += 1
smpl_mean = sum(t8_means) / t8_r_tmp
rnd_mean = f"{smpl_mean:.2f}"
plt_mean = float(rnd_mean)
t8_stdev = statistics.stdev(t8_means)
t8_se = t_conf_int(t8_size, t8_stdev, p_interval=95, tails=2)
with open(P_PATH / t8_data_file, 'a') as fl_dbug:
fl_dbug.write("{")
fl_dbug.write(f" 'parameters': [{t8_size}, {t8_r_tmp}, {t8_seed}],\n")
fl_dbug.write(f" 'means': {t8_means},\n")
fl_dbug.write(f" 'stderrs': {t8_stderrs},\n")
fl_dbug.write(f" 'missed': {missed_interval}, 'sample mean': {smpl_mean}, 'sample std dev': {t8_stdev}, 'sample std err': {t8_se}\n")
fl_dbug.write("},\n")
kargs = {
'title': f'Sample Repeated {t8_r_tmp} Times',
'mean': plt_mean,
'm_lbl': f'Mean of all {t8_rpt} samples',
'pop_mean': world_avg,
'ms95': missed_interval,
'se': t8_se
}
plot_hist_basic(axs[r_nbr, c_nbr], t8_means, **kargs)
A Look at a Few Different Combinations
Varying Random Seed for Each Sample and Repetition Value
(base) PS R:\learn\py_play> conda activate base-3.8
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 9 -c 20
loading CSV and getting data from population database took 9.80 (1611161888.603303-1611161898.403286)
generating sample stats took 0.41 (1611161898.403286-1611161898.810870)
Test #9: Compare repeating a single sample (size -c) a differing number of times
Initial seed value: 12792
whole sampling process took 13.16
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 9 -c 30
loading CSV and getting data from population database took 9.14 (1611162051.543878-1611162060.681555)
generating sample stats took 0.64 (1611162060.681555-1611162061.323608)
Test #9: Compare repeating a single sample (size -c) a differing number of times
Initial seed value: 14911
whole sampling process took 12.53
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 9 -c 50
loading CSV and getting data from population database took 9.52 (1611162180.255771-1611162189.774642)
generating sample stats took 0.96 (1611162189.774642-1611162190.731630)
Test #9: Compare repeating a single sample (size -c) a differing number of times
Initial seed value: 16588
whole sampling process took 14.30
And in the above (50 x 30), the world (population) average never once fell within the 95% confidence interval for the result of all the repetitions combined. Close, but… Let’s try again using a larger starting value for the number of repetitions and a different random seed.
(base-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\est_avg_age.py -e 9 -c 50 -r 50
loading CSV and getting data from population database took 9.37 (1611162961.967492-1611162971.339756)
generating sample stats took 0.50 (1611162971.339756-1611162971.842826)
Test #9: Compare repeating a single sample (size -c) a differing number of times
Initial seed value: 598
whole sampling process took 14.22
And, no real change. The world (population) average once again never once fell within the 95% confidence interval for any of the four repeated samplings.
As I appreciate that the figures are a “wee bit small”, here’s a table of the data for the above four experiment executions. Note: ✓ indicates the world average was within the 95% confidence interval for all the repetitions, or that the number of samples that did not contain the world average within their 95% confidence interval was less than 5%. An 𝖷 means it/they did not.
Sample | Number of | Total of Repeated Samples | # Samples That | ||
---|---|---|---|---|---|
Size | Repetitions | Mean | Std Dev | Std Err | Missed 95% CI |
20 | 30 | 28.16 | 1.33 | 0.62 ✓ | 1 ✓ |
50 | 28.25 | 1.56 | 0.73 ✓ | 2 ✓ | |
70 | 28.23 | 1.70 | 0.79 ✓ | 3 ✓ | |
100 | 28.15 | 1.82 | 0.85 ✓ | 4 ✓ | |
30 | 30 | 28.52 | 1.33 | 0.50 ✓ | 1 ✓ |
50 | 28.43 | 1.36 | 0.51 ✓ | 2 ✓ | |
70 | 28.43 | 1.37 | 0.51 ✓ | 3 ✓ | |
100 | 28.51 | 1.39 | 0.52 ✓ | 4 ✓ | |
50 | 30 | 27.95 | 1.44 | 0.41 𝖷 | 4 𝖷 |
50 | 28.23 | 1.39 | 0.40 𝖷 | 4 𝖷 | |
70 | 28.29 | 1.27 | 0.36 𝖷 | 4 𝖷 | |
100 | 28.18 | 1.24 | 0.35 𝖷 | 7 𝖷 | |
50 | 50 | 28.35 | 0.90 | 0.25 𝖷 | 1 ✓ |
85 | 28.34 | 0.99 | 0.28 𝖷 | 2 ✓ | |
115 | 28.33 | 0.95 | 0.27 𝖷 | 2 ✓ | |
165 | 28.25 | 0.95 | 0.27 𝖷 | 3 ✓ |
Same Random Seed
Now I am going to go back and run those last three using the same initial random seed, 12792, as the first one (sample size 10 with 30 initial repetitions).
As mentioned above, the figures are a “wee bit small”, so here’s a table of the data for the above three and the first with the same random seed.
Sample | Number of | Total of Repeated Samples | # Samples That | ||
---|---|---|---|---|---|
Size | Repetitions | Mean | Std Dev | Std Err | Missed 95% CI |
20 | 30 | 28.16 | 1.33 | 0.62 ✓ | 1 ✓ |
50 | 28.25 | 1.56 | 0.73 ✓ | 2 ✓ | |
70 | 28.23 | 1.70 | 0.79 ✓ | 3 ✓ | |
100 | 28.15 | 1.82 | 0.85 ✓ | 4 ✓ | |
30 | 30 | 28.26 | 1.37 | 0.51 ✓ | 2 𝖷 |
50 | 28.18 | 1.51 | 0.56 ✓ | 4 𝖷 | |
70 | 28.22 | 1.51 | 0.56 ✓ | 6 𝖷 | |
100 | 28.21 | 1.51 | 0.56 ✓ | 8 𝖷 | |
50 | 30 | 28.30 | 1.06 | 0.30 𝖷 | 1 ✓ |
50 | 28.25 | 1.04 | 0.29 𝖷 | 2 ✓ | |
70 | 28.28 | 1.01 | 0.29 𝖷 | 3 ✓ | |
100 | 28.29 | 0.97 | 0.28 𝖷 | 3 ✓ | |
50 | 50 | 28.25 | 1.04 | 0.29 𝖷 | 2 ✓ |
85 | 28.23 | 0.99 | 0.28 𝖷 | 3 ✓ | |
115 | 28.29 | 1.01 | 0.29 𝖷 | 4 ✓ | |
165 | 28.30 | 1.01 | 0.29 𝖷 | 4 ✓ |
Sample Size of 50?
In none of our above attempts with a sample size of 50, did any of our sixteen 95% confidence intervals contain the world average. Close — but no cigar. Though the number of decimal points may be a factor. So, I am going to execute the experiment numerous times using a sample size of 50 and a starting repetition value of 50. Allowing the code to select the starting random seed each time. The results look like this. (Random seed in brackets in sample size column in case you wish to try repeating the sampling and charting.)
Sample | Number of | Total of Repeated Samples | # Samples That | ||
---|---|---|---|---|---|
Size | Repetitions | Mean | Std Dev | Std Err | Missed 95% CI |
50 (17550) | 50 | 28.37 | 1.12 | 0.32 ✓ | 2 ✓ |
85 | 28.40 | 1.03 | 0.29 ✓ | 3 ✓ | |
115 | 28.41 | 1.02 | 0.29 ✓ | 4 ✓ | |
165 | 28.40 | 1.00 | 0.28 ✓ | 6 ✓ | |
50 (19487) | 50 | 28.38 | 1.04 | 0.30 ✓ | 1 ✓ |
85 | 28.41 | 1.06 | 0.30 ✓ | 4 ✓ | |
115 | 28.43 | 1.02 | 0.29 ✓ | 4 ✓ | |
165 | 28.36 | 1.06 | 0.30 𝖷 | 6 ✓ | |
50 (20176) | 50 | 28.25 | 0.87 | 0.25 𝖷 | 1 ✓ |
85 | 28.29 | 0.93 | 0.26 𝖷 | 3 ✓ | |
115 | 28.29 | 0.99 | 0.28 𝖷 | 6 𝖷 | |
165 | 28.34 | 0.95 | 0.27 𝖷 | 6 ✓ | |
50 (20826) | 50 | 28.13 | 1.03 | 0.29 𝖷 | 1 ✓ |
85 | 28.21 | 0.98 | 0.28 𝖷 | 2 ✓ | |
115 | 28.30 | 1.00 | 0.28 𝖷 | 2 ✓ | |
165 | 28.32 | 1.04 | 0.29 𝖷 | 4 ✓ | |
50 (21632) | 50 | 28.37 | 1.02 | 0.29 𝖷 | 1 ✓ |
85 | 28.49 | 1.07 | 0.30 ✓ | 2 ✓ | |
115 | 28.47 | 1.12 | 0.32 ✓ | 4 ✓ | |
165 | 28.34 | 1.10 | 0.31 𝖷 | 6 ✓ |
For comparison, I then did the same thing with a sample of size 30 repeated with a starting value of 50. For five differing random seeds.
Sample | Number of | Total of Repeated Samples | # Samples That | ||
---|---|---|---|---|---|
Size | Repetitions | Mean | Std Dev | Std Err | Missed 95% CI |
30 (22711) | 50 | 28.21 | 1.51 | 0.56 ✓ | 3 𝖷 |
85 | 28.10 | 1.45 | 0.54 𝖷 | 6 𝖷 | |
115 | 28.17 | 1.42 | 0.53 ✓ | 6 𝖷 | |
165 | 28.29 | 1.45 | 0.54 ✓ | 8 ✓ | |
30 (23686) | 50 | 28.20 | 1.44 | 0.53 ✓ | 4 𝖷 |
85 | 28.25 | 1.24 | 0.46 ✓ | 4 ✓ | |
115 | 28.26 | 1.35 | 0.50 ✓ | 6 𝖷 | |
165 | 28.34 | 1.40 | 0.52 ✓ | 8 ✓ | |
30 (24310) | 50 | 28.21 | 1.33 | 0.50 ✓ | 0 ✓ |
85 | 28.36 | 1.39 | 0.52 ✓ | 2 ✓ | |
115 | 28.31 | 1.35 | 0.50 ✓ | 3 ✓ | |
165 | 28.37 | 1.41 | 0.53 ✓ | 5 ✓ | |
30 (24908) | 50 | 28.26 | 1.71 | 0.64 ✓ | 6 𝖷 |
85 | 28.22 | 1.62 | 0.61 ✓ | 8 𝖷 | |
115 | 28.26 | 1.60 | 0.60 ✓ | 10 𝖷 | |
165 | 28.27 | 1.54 | 0.57 ✓ | 13 𝖷 | |
30 (26013) | 50 | 28.32 | 1.36 | 0.51 ✓ | 2 ✓ |
85 | 28.33 | 1.40 | 0.52 ✓ | 3 ✓ | |
115 | 28.24 | 1.45 | 0.54 ✓ | 5 ✓ | |
165 | 28.33 | 1.44 | 0.54 ✓ | 7 ✓ |
Conclusions?
Well, sampling can get us close to the actual population average. With some caveats.
To me it looks like using a sample size of 50 countries is questionable. It is at though we are over sampling and not quite getting the correct answer. Though it is still very close to the one we were after. But it seemed smaller sample sizes got there more frequently than the larger size. And, I don’t think it is related to the number of repetitions as those are the same for many of the experiments. So, the resulting z-value would be the same for all of them. The only difference would be the variability of the data in each execution, as measured by the standard deviation.
And, there didn’t really seem to be a correlation between the number of individual samples that didn’t contain the population average in their 95% confidence interval and whether or not the population mean was in the 95% confidence interval for the mean of the repeated samples.
Also, for each set where we used the same starting random seed, if the initial value for the smallest number of repetitions was “bad”, taking additional samples didn’t appear to help. Not sure why not.
I believe there is something not quite right with the data we are using. As the population mean was always near the upper boundary for the 95% confidence interval. I would have expected just a bit more variation. And, we did notice before that the sample means always seemed to be lower than the actual population mean. I expect this is likely attributable to my approach to selecting the countries from the population database. Or, possibly, to data missing from the database.
I also believe I really need to try taking a statistics course.
That’s It For This One
Lengthy post. Though lots more charts and data than content — such is life. And, finally done with this topic! At least for the time being.
I have no real idea where to go next. Some things rolling around in my head, but…
That said, I think I am going to waste some time and try to animate a chart. Saw something in an article, and thought I’d give it a try. Likely waste a bunch of time; but, an animation often provides information to those viewing it. Not sure what to animate, but I will start with one of our repeated sampling histograms. I would look at showing each sample being added to the chart in the appropriate column.
As I said, likely a big waste of time. But, it might prove fun.
I am thinking about perhaps also trying to animate covid-19 data after sorting the histograms.
Oh, yes. I will also merge my avg-age branch into the main branch before starting on anything else. Perhaps create one for the animation coding. Though not sure that is really necessary as I will like start a completely new code module.
Until next time.
Resources
- git-show - Show various types of objects
- How can I view an old version of a file with Git?