Let’s Finish This
Last time, we called it quits after getting the two charts done. And, saved to a blog post. Though we did rework things a fair bit. Not to mention some bug fixes and enhancements to previous work. All we really have left is the listing/table of all the descriptive statistics based on a given sample data distribution (e.g. bin labels and bin values) for a given country and year. Should be a fairly quick post.
Again, I am likely going to do more than you may need to do. I want to put everything that is generated for a given country into a post for this blog. So, I have to save things to particular location for each country. Create a post file (index.md as I am using markdown with Hugo to generate the posts). Save the plots to image files in a suitable location. Append the image links to the post file. Add the table of the complete list of descriptive stats to the post. Etc. Most of that is done, except for the tables, one for each country and one for all the countries combined.
Get All the Descriptive Statistics
I am going to start by creating the function get_all_dstats() in chart/descriptive_stats.py. It will take a list of the base bin labels, and the binned population data for the specified country and year. And, it will return a dictionary of the descriptive statistics values, keyed on a statistic name. I didn’t have it format the data in any particular way, as that would be mixing concerns unnecessarily. I will worry about formatting the table within stats_rand.test.py — using functions for each particular case.
get_all_dstats()
This is really just a rework of one of my/our test cases. In my case, test #6. When you’re ready, here’s my version. I did think about getting rid of all the extra variables and just putting the function call in the assignment statements. But…
def get_all_dstats(b_lbls, b_data):
desc_stats = {}
grp_lows = get_grp_low_bdrys(b_lbls)
grp_mids = get_grp_middle(b_lbls, b_width=5)
grp_highs = get_grp_up_bdrys(b_lbls)
s_mean = get_wt_mean(grp_mids, b_data)
desc_stats['sample mean'] = s_mean
mean_low = get_wt_mean(grp_lows, b_data)
mean_high = get_wt_mean(grp_highs, b_data)
desc_stats['est. mean low'] = mean_low
desc_stats['est. mean high'] = mean_high
s_var = get_binned_var(grp_mids, b_data, s_mean)
s_sdev = get_binned_sd(grp_mids, b_data, s_mean)
desc_stats['sample variance'] = s_var
desc_stats['sample standard deviation'] = s_sdev
sc_var = var_sheppard(s_var, b_wd=5)
sc_sdev = sd_sheppard(s_var, b_wd=5)
desc_stats['sheppard corrected variance'] = sc_var
desc_stats['sheppard corrected standard deviation'] = sc_sdev
s_q2 = get_wt_quartile(grp_lows, b_data)
s_q1 = get_wt_quartile(grp_lows, b_data, qnbr=1)
s_q3 = get_wt_quartile(grp_lows, b_data, qnbr=3)
desc_stats['1st quartile'] = s_q1
desc_stats['2nd quartile (median)'] = s_q2
desc_stats['3rd quartile'] = s_q3
iqr = s_q3 - s_q1
low_out = s_q1 - (1.5 * iqr)
up_out = s_q3 + (1.5 * iqr)
desc_stats['IQR'] = iqr
desc_stats['low outlier boundary'] = low_out
desc_stats['upper outlier boundary'] = up_out
s_ub_var = get_binned_var(grp_mids, b_data, s_mean, is_pop=True)
moment_3 = get_binned_3rd_moment(grp_mids, b_data, s_mean)
tot_n = sum(b_data)
coeff_skew = calc_coeff_skew(moment_3, s_ub_var, tot_n)
moment_4 = get_binned_4th_moment(grp_mids, b_data, s_mean)
desc_stats['moment coefficient of skewness'] = coeff_skew
desc_stats["Pearson's 2nd coefficient"] = ((3*(s_mean - s_q2))/s_sdev)
desc_stats["Bowley's measure of skewness"] = ((s_q1 + s_q3 - (2*s_q2)) / iqr)
kurtosis = calc_kurtosis(moment_4, s_ub_var)
desc_stats['kurtosis'] = kurtosis
return desc_stats
I did not add a test for this function in chart/descriptive_stats.py. I figured I’d just test in stats_rand.test.py as I went along.
stats_rand.test.py
In addition to putting the list of descriptive statistics for each country in its blog post, I also want to build a larger table combining the stats for all randomly selected countries to add at the bottom of this post. So, I am going to need to save each country’s data as I go through my loop. And, make sure it is available once the loop terminates.
I will then need some functions to process the lists of statistics into an appropriate format for the related blog posts. So, before starting the loop, I will create a new variable: stat_by_cr = {}
. As I generate the list of stats for each country I will save it to this dictionary keyed on the country name, chk_nm
, as obtained from our search function.
Once I have plotted and saved the two histogram charts, I will get the list of stats for the current country, and save it to my new dictionary:
all_stats = dstat.get_all_dstats(labels, cy_p_data[use_yr])
stat_by_cr[chk_nm] = all_stats
Before continuing I did print the value of all_stats to the command line to make sure the call to get_all_dstats() was working more or less as expected. Again I have not done any significant testing to make absolutely sure. I may yet do so.
Now I need to format the list for the current country and add it to the appropriate file. So a couple new functions, none of which you probably need. Both fairly short. I am making full use of f-strings and the padding options. See the Resources section at the bottom of the post. The path / 'index.md'
in the open statement in add_stats_tbl_2_post is specific to pathlib module, which I am using insted of the os module. Have a look at “Why you should be using pathlib”.
def stats_2_md(c_nm, p_yr, dstats):
cr_yr = f"{c_nm}, {p_yr}"
md = f" {'Statistic':37} | {cr_yr} \n"
md += f" {'-' * 37} | {'-' * len(cr_yr)} \n"
for stat_nm in dstats:
md += f" {stat_nm:37} | {dstats[stat_nm]:>10.2f} \n"
return md
def add_stats_tbl_2_post(c_nm, p_yr, dstats, path):
"""Assumes post file is a markdown file named index.md in the directory specfified by path
dstats is dictionary of stats returned by call to descriptive_stats.get_all_dstats()
Uses stats_2_md() to convert stats dictionary into a markdown table
"""
with open(path / 'index.md', 'a') as blog_fl:
blog_fl.write("\n\n# Descriptive Statistics\n\n")
tbl_md = stats_2_md(c_nm, p_yr, dstats)
blog_fl.write(f"{tbl_md}\n")
And, here’s a look at the table for a single country. I would have preferred if the values in the data column where aligned on the decimal point, but… Who knows, may eventually try to fix that.
Statistic | Cuba, 2011 |
---|---|
sample mean | 38.44 |
est. mean low | 35.94 |
est. mean high | 40.94 |
sample variance | 473.69 |
sample standard deviation | 21.76 |
sheppard corrected variance | 471.61 |
sheppard corrected standard deviation | 21.72 |
1st quartile | 20.69 |
2nd quartile (median) | 38.88 |
3rd quartile | 53.73 |
IQR | 33.04 |
low outlier boundary | -28.86 |
upper outlier boundary | 103.28 |
moment coefficient of skewness | 0.20 |
Pearson’s 2nd coefficient | -0.06 |
Bowley’s measure of skewness | -0.10 |
kurtosis | 2.27 |
Though they are nicely aligned in my output. Here’s a sample of the actual output from the function.
Statistic | Cuba, 2011
------------------------------------- | ----------
...
IQR | 33.04
low outlier boundary | -28.86
upper outlier boundary | 103.28
moment coefficient of skewness | 0.20
Pearson's 2nd coefficient | -0.06
...
Multi-country Table
Ok, this one you can ignore. But, I am including it in case anyone is interested. The multi-country table generating function uses f-strings pretty much the same way as stats_2_md() did. But I need to loop through the countries in the dictionary to produce each line. There may be a better way, but, if there is, I haven’t figured it out just yet. (Well, this might be a case where the zip() function could be used. But, I have not yet looked at that. So, it may not prove any easier or clearer. See Resources section.)
def mult_stats_2_md(p_yr, dstats):
# get list of sorted country names
cr_alpha = sorted(dstats.keys())
s_lbl = f"Statistic ({p_yr})"
md = f" {s_lbl:37}"
ln_sep = f" {'-' * 37}"
for c_nm in cr_alpha:
l_pad = max(len(c_nm), 6)
md += f" | {c_nm:{l_pad}}"
ln_sep += f" | {'-' * l_pad}"
md += f"\n{ln_sep}\n"
for stat_nm in dstats[cr_alpha[0]].keys():
md += f" {stat_nm:37}"
for c_nm in cr_alpha:
l_pad = max(len(c_nm), 6)
md += f" | {dstats[c_nm][stat_nm]:{l_pad}.2f}"
md += "\n"
return md
5 Random Countries Processed
And, here we go. Think the code is pretty much done. So, going to generate the charts and statistics for 5 randomly selected countries. The selected countries will vary each time the module is executed, and could differ from computer to computer. Would have like to have some of the bigger countries in the world, but, I guess, that’s what random does.
(base) PS R:\learn\py_play> conda activate base-3.8
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/stats_rand.test.py
1: Checking for 'Turkmenistan (Turkmenistan)':
found 'Turkmenistan' in population database
2: Checking for 'Colombia, Republic of (Colombia)':
found 'Colombia' in population database
3: Checking for 'United Nations Neutral Zone (United Nations Neutral Zone)':
did not find 'United Nations Neutral Zone' in population database
3: Checking for 'Azerbaijan, Republic of (Azerbaijan)':
found 'Azerbaijan' in population database
4: Checking for 'Kenya, Republic of (Kenya)':
found 'Kenya' in population database
5: Checking for 'Cuba, Republic of (Cuba)':
found 'Cuba' in population database
And, the table combining all 5 countries.
Statistic (2011) | Azerbaijan | Colombia | Cuba | Kenya | Turkmenistan |
---|---|---|---|---|---|
sample mean | 31.61 | 30.96 | 38.44 | 21.87 | 27.91 |
est. mean low | 29.11 | 28.46 | 35.94 | 19.37 | 25.41 |
est. mean high | 34.11 | 33.46 | 40.94 | 24.37 | 30.41 |
sample variance | 383.92 | 418.02 | 473.69 | 287.53 | 361.37 |
sample standard deviation | 19.59 | 20.45 | 21.76 | 16.96 | 19.01 |
sheppard corrected variance | 381.83 | 415.94 | 471.61 | 285.45 | 359.29 |
sheppard corrected standard deviation | 19.54 | 20.39 | 21.72 | 16.90 | 18.95 |
1st quartile | 16.27 | 14.13 | 20.69 | 8.08 | 12.61 |
2nd quartile (median) | 29.00 | 27.93 | 38.88 | 18.20 | 24.72 |
3rd quartile | 46.02 | 45.39 | 53.73 | 31.98 | 40.74 |
IQR | 29.76 | 31.26 | 33.04 | 23.90 | 28.14 |
low outlier boundary | -28.37 | -32.77 | -28.86 | -27.77 | -29.60 |
upper outlier boundary | 90.66 | 92.28 | 103.28 | 67.83 | 82.95 |
moment coefficient of skewness | 0.46 | 0.56 | 0.20 | 0.95 | 0.65 |
Pearson’s 2nd coefficient | 0.40 | 0.44 | -0.06 | 0.65 | 0.50 |
Bowley’s measure of skewness | 0.14 | 0.12 | -0.10 | 0.15 | 0.14 |
kurtosis | 2.54 | 2.55 | 2.27 | 3.44 | 2.83 |
And, here are the pages/posts I generated for each of those countries.
That’s It for this One
As I said, short one. Not too sure what will happen next time. But, I do think I will go back to selecting countries at random and trying to estimate the median world age. Expect there will be a great deal of trial and error involved. Until then (next Monday).
Before then, I will commit all my changes and look at merging my py-stats branch into the main branch.
Resources
- Python f-strings - The Ultimate Usage Guide
- Python f-string tutorial
- How to pad/fill a string by a variable in Python using f-strings
- Why you should be using pathlib
- Using the Python zip() Function for Parallel Iteration