Bet you never thought I’d get to coding functions for the Measures of Shape. Well primarily skewness. Though we will likely finish with a look at the probability density function with a little help from SciPy.
Measures of Shape
I am going to start with skewness. I may add a function to generate a number for the kurtosis of our sample as well. Then if everything is going well, we’ll install SciPy and look at calculating an estimated PDF for our data sample.
Skewness
Basically skewness attempts to measure how far from symmetrical our data is. That is, how much different the left side looks from the right side of our data distribution.
Adjusted Fisher-Pearson Coefficient of Skewness
where: (G_{1}) is the Adjusted Fisher-Pearson Coefficient of Skewness,
(N) is the number of data points, which in our case is (\sum_{i=1}^n w_{i})
(s) is our estimated standard deviation,
(\bar Y) is the estimated sample mean,
(Y_{i}) is the value of the ith variable (bin mid value to keep things consistent) and
(w_{i}) is the binned weight/value for the bin with value (Y_{i})
Note that in computing the skewness, \(s\) is computed with \(N\) in the denominator rather than \(N - 1\). So we could re-write the above as:
$$G_{1} = \frac{\sqrt{N (N -1)}}{N - 2} \frac{\sum_{i=1}^N w_{i} (Y_{i} - \bar Y)^3 / N}{\lbrack \sum_{i=1}^N w_{i} (Y_{i} - \bar Y)^2 / N \rbrack^\frac{3}{2}}$$
If we had all the actual values for the data distribution, all the \(w_{i}\) would be removed from the equations above.
From my reading, in the fraction on the right, the numerator is known as the unbiased estimator of the third moment or third cumulant. The denoninator is the unbiased variance, or second moment taken to the power of 3/2. I am going to write a function to calculate the third moment, then pass that value, the unbiased variance and the total weight of our bins (N) to a function to calcluate the coefficient of skewness. You may have noticed in my definition of the function get_binned_var() I had a parameter is_pop=False. If is_pop is false I returned the biased estimate of the variance. If true I return the unbiased estimate, which is what we want in our case. So, some functions to write and a new test section to set up (do_tst == 6).
def get_binned_3rd_moment(b_mids, b_wts, s_mean):
tot_wt = sum(b_wts)
sum_3rd = 0
for i, wt in enumerate(b_wts):
sum_3rd += wt * (b_mids[i] - s_mean)**3
moment = sum_3rd / tot_wt
return moment
def calc_coeff_skew(m_3rd, ub_var, s_n):
# adjusted Fisher-Pearson coefficient of skewness
# Note that in computing the skewness, the s is computed with N in the denominator rather than N - 1.
# https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
# Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right.
# in https://brownmath.com/stat/shape.htm:
# Bulmer (1979) https://brownmath.com/swt/sources.htm#so_Bulmer1979 = Bulmer, M. G. 1979. 'Principles of Statistics'. Dover. suggests
# If skewness is less than −1 or greater than +1, the distribution is highly skewed.
# If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
# If skewness is between −½ and +½, the distribution is approximately symmetric.
bias = ((s_n * (s_n - 1))**0.5) / (s_n - 2)
ub_coeff = m_3rd / (ub_var**(3/2))
return bias * ub_coeff
# moved calc for grp_mids out of conditional for tests
# add test #6, stopped plotting of histogram is test six being run
if do_tst == 6:
s_ub_var = get_binned_var(grp_mids, p_data, s_mean, is_pop=True)
moment_3 = get_binned_3rd_moment(grp_mids, p_data, s_mean)
tot_n = sum(p_data)
coeff_skew = calc_coeff_skew(moment_3, s_ub_var, tot_n)
print(f"unbiased var: {s_ub_var}, 3rd moment: {moment_3}, coeff skew: {coeff_skew}")
if do_tst >= 2 and do_tst != 6:
ax.legend()
if do_tst != 6:
plt.show()
And, I get the following, which jives with the number I got while working on the first post covering Descriptive Statistics.
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 6
0 China 2011
unbiased var: 411.33991695119494, 3rd moment: 1953.0873458780957, coeff skew: 0.23411048000251844
As mentioned in that first post there are other measures of skewness. Let’s got through them quickly.
Pearson’s second skewness coefficient (median skewness)
where: (S_{k_{2}}) is Pearson’s second skewness coefficient,
(\bar{Y}) is the estimated sample mean,
(\tilde{Y}) is the estimated sample median and
(s) is the estimated sample standard deviation
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 6
0 China 2011
unbiased var: 411.33991695119494, 3rd moment: 1953.0873458780957, coeff skew: 0.23411048000251844
Pearson's 2nd coefficient: 0.026927189879036757
Which doesn’t quite match with the value in the first post. In that post I used the Sheppard’s corrected variance. This time I am using the uncorrected sample variance.
Bowley Skewness Formula
The value I get, -0.033876, matches the value in the first post.
Kurtosis
You may recall we used the third moment of the distribution in calculating skewness. Kurtosis uses the fourth moment, which means that outliers in a sample have even more effect on kurtosis than they do on skewness.
where: (N) is the number of data points, which in our case is (\sum_{i=1}^n w_{i})
(\bar Y) is the estimated sample mean,
(Y_{i}) is the value of the ith variable (bin mid value to keep things consistent),
(w_{i}) is the binned weight/value for the bin with value (Y_{i}) and
(s) is our estimated unbiased standard deviation (calculated using (N), not (N-1))
Okay a new function or two, and a quick test.
def get_binned_4th_moment(b_mids, b_wts, s_mean):
tot_wt = sum(b_wts)
sum_4th = 0
for i, wt in enumerate(b_wts):
sum_4th += wt * (b_mids[i] - s_mean)**4
moment = sum_4th / tot_wt
return moment
def calc_kurtosis(m_4th, ub_var):
# kurtosis for a standard normal distribution is 3
kurtosis = m_4th / (ub_var**4)
return bias * ub_coeff
The value I get is 2.307935 (after a bit of rounding). Near as I can make out that implies a “light tailed” distribution, i.e. not a lot of outliers. Again perhaps indicating more symmetry than what I think I see in the histogram - though not sure that is a valid interpretation. Note, a univariate normal distribution has a kurtosis of 3.
All the Measures
I modified test #6 to print out all the measures — likely to a few too many decimal places.
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 6
0 China 2011
Mean, Variance and Standard Deviation:
estimated sample mean: 35.558426
mean lower bound: 33.058433, mean upper bound: 38.058426
sample weighted variance: 411.340216, sheppard corrected variance: 409.256882
sample weighted standard deviation: 20.281524, sheppard corrected std dev: 20.230098
Median, Quartiles and IQR:
1st quartile: 19.728773, 2nd quartile (median): 35.376384, 3rd quartile: 49.998570
IQR: 30.269797 and upper outlier value: 95.403265
Measures of Skewness:
moment coefficient of skewness: 0.234110
Pearson 2nd coefficient: 0.026927
Bowley measure of skewness: -0.033876
And the code for test #6 now looks like the following. I also had to move a few calculations out of the other test sections so that I could use them in test #6.
if do_tst == 6:
print("\nMean, Variance and Standard Deviation:")
print(f"\testimated sample mean: {s_mean:.6f}")
print(f"\t\tmean lower bound: {mean_low:.6f}, mean upper bound: {mean_high:.6f}")
print(f"\tsample weighted variance: {s_var:.6f}, sheppard\'s corrected variance: {sc_var:.6f}")
print(f"\tsample weighted standard deviation: {s_sdev:.6f}, sheppard\'s corrected std dev: {sc_sdev:.6f}")
print("\nMedian, Quartiles and IQR:")
print(f"\t1st quartile: {s_q1:.6f}, 2nd quartile (median): {s_q2:.6f}, 3rd quartile: {s_q3:.6f}")
print(f"\tIQR: {iqr:.6f} and upper outlier value: {up_out:.6f}")
s_ub_var = get_binned_var(grp_mids, p_data, s_mean, is_pop=True)
moment_3 = get_binned_3rd_moment(grp_mids, p_data, s_mean)
tot_n = sum(p_data)
coeff_skew = calc_coeff_skew(moment_3, s_ub_var, tot_n)
print("\nMeasures of Skewness:")
#print(f"\tunbiased var: {s_ub_var}, 3rd moment: {moment_3}")
print(f"\tmoment coefficient of skewness: {coeff_skew:.6f}")
print(f"\tPearson's 2nd coefficient: {((3*(s_mean - s_median))/s_sdev):.6f}")
print(f"\tBowley's measure of skewness: {((s_q1 + s_q3 - (2*s_median)) / iqr):.6f}")
A Little Playing Around
Let’s look at getting and plotting the estimated “probability density function” (PDF) for our data. To do that we will need to install SciPy, well at least I did. There are likely other ways.
(base-3.8) PS R:\learn\py_play> conda install scipy
Collecting package metadata (current_repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.7.12
latest version: 4.8.5
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: E:\appDev\Miniconda3\envs\base-3.8
added / updated specs:
- scipy
The following packages will be downloaded:
package | build
---------------------------|-----------------
openssl-1.1.1h | he774522_0 4.8 MB
scipy-1.5.0 | py38h9439919_0 12.0 MB
------------------------------------------------------------
Total: 16.8 MB
The following NEW packages will be INSTALLED:
scipy pkgs/main/win-64::scipy-1.5.0-py38h9439919_0
The following packages will be UPDATED:
openssl 1.1.1g-he774522_1 -> 1.1.1h-he774522_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
openssl-1.1.1h | 4.8 MB | ############################################################################ | 100%
scipy-1.5.0 | 12.0 MB | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(base-3.8) PS R:\learn\py_play>
Then we need a bit of code to calculate the PDF, and plot it on the histogram. Oh yes, I added a test #7 to do that work. Don’t forget to import the necessary stuff from SciPy. I used from scipy import stats
.
if do_tst == 7:
b_width = 5
h_area = 0
for pop in p_data:
h_area += b_width * pop
best_fit_line = stats.norm.pdf(bins, s_mean, s_sdev)
print(f"best_fit_line = stats.norm.pdf({bins}, {s_mean}, {s_sdev}) = {best_fit_line}")
best_fit_line *= h_area
print(best_fit_line)
plt.plot(bins, best_fit_line, 'r-', alpha=0.6, label='Est. normal PDF')
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 7
0 China 2011
best_fit_line = stats.norm.pdf([ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
90 95 100 105], 35.55842554253051, 20.28152400048459) = [4.22989496e-03 6.32187067e-03 8.89132615e-03 1.17677216e-02
1.46562593e-02 1.71774543e-02 1.89452079e-02 1.96627773e-02
1.92041572e-02 1.76502376e-02 1.52654907e-02 1.24244134e-02
9.51581360e-03 6.85836934e-03 4.65158244e-03 2.96883106e-03
1.78309777e-03 1.00778929e-03 5.36005485e-04 2.68270943e-04
1.26352241e-04 5.60011791e-05]
# best fit * h_area
[ 29112.20203425 43510.20009377 61194.44698875 80991.2045835
100871.53090019 118223.62611197 130390.16933982 135328.8318062
132172.38487972 121477.55150998 105064.55926672 85510.87812094
65492.47446373 47202.64581587 32014.46108379 20432.94464441
12272.14933204 6936.09789486 3689.05140355 1846.37159136
869.61780259 385.42745239]
Until Another Day
Think I will call it quits right here. My plan is to now put the tests, as appropriate, into functions. Along with whatever "global" data I need to make things work. Then, in the next post, I am going to generate, plots, descriptive stats, etc for Germany, United States of America and Kenya to compare with that of China for the year 2011. Not sure how I will display it all, but will do what I can. Could be a waste of time, but I am curious as to what we might or might not find.
Until then…
Resources
- Skewness
- Measures of Skewness and Kurtosis
- Measures of Shape: Skewness and Kurtosis
- Bowley Skewness: Definition and Worked Example
- Kurtosis
- SciPy
- scipy.stats.norm
- How to fit a distribution to a histogram in Python
- Fitting distribution in histogram using Python
- Fit a curve to a histogram in Python