Bet you never thought I’d get to coding functions for the Measures of Shape. Well primarily skewness. Though we will likely finish with a look at the probability density function with a little help from SciPy.

Measures of Shape

I am going to start with skewness. I may add a function to generate a number for the kurtosis of our sample as well. Then if everything is going well, we’ll install SciPy and look at calculating an estimated PDF for our data sample.

Skewness

Basically skewness attempts to measure how far from symmetrical our data is. That is, how much different the left side looks from the right side of our data distribution.

Adjusted Fisher-Pearson Coefficient of Skewness

$$G_{1} = \frac{\sqrt{N (N -1)}}{N - 2} \frac{\sum_{i=1}^N w_{i} (Y_{i} - \bar Y)^3 / N}{s^3}$$

where: (G_{1}) is the Adjusted Fisher-Pearson Coefficient of Skewness,
(N) is the number of data points, which in our case is (\sum_{i=1}^n w_{i})
(s) is our estimated standard deviation,
(\bar Y) is the estimated sample mean,
(Y_{i}) is the value of the ith variable (bin mid value to keep things consistent) and
(w_{i}) is the binned weight/value for the bin with value (Y_{i})

Note that in computing the skewness, \(s\) is computed with \(N\) in the denominator rather than \(N - 1\). So we could re-write the above as:

$$G_{1} = \frac{\sqrt{N (N -1)}}{N - 2} \frac{\sum_{i=1}^N w_{i} (Y_{i} - \bar Y)^3 / N}{\lbrack \sum_{i=1}^N w_{i} (Y_{i} - \bar Y)^2 / N \rbrack^\frac{3}{2}}$$

If we had all the actual values for the data distribution, all the \(w_{i}\) would be removed from the equations above.

From my reading, in the fraction on the right, the numerator is known as the unbiased estimator of the third moment or third cumulant. The denoninator is the unbiased variance, or second moment taken to the power of 3/2. I am going to write a function to calculate the third moment, then pass that value, the unbiased variance and the total weight of our bins (N) to a function to calcluate the coefficient of skewness. You may have noticed in my definition of the function get_binned_var() I had a parameter is_pop=False. If is_pop is false I returned the biased estimate of the variance. If true I return the unbiased estimate, which is what we want in our case. So, some functions to write and a new test section to set up (do_tst == 6).

def get_binned_3rd_moment(b_mids, b_wts, s_mean):
  tot_wt = sum(b_wts)
  sum_3rd = 0
  for i, wt in enumerate(b_wts):
    sum_3rd += wt * (b_mids[i] - s_mean)**3
  moment = sum_3rd / tot_wt
  return moment


def calc_coeff_skew(m_3rd, ub_var, s_n):
  # adjusted Fisher-Pearson coefficient of skewness
  # Note that in computing the skewness, the s is computed with N in the denominator rather than N - 1. 
  # https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
  # Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right.
  # in https://brownmath.com/stat/shape.htm:
  # Bulmer (1979) https://brownmath.com/swt/sources.htm#so_Bulmer1979 = Bulmer, M. G. 1979. 'Principles of Statistics'. Dover. suggests
  # If skewness is less than −1 or greater than +1, the distribution is highly skewed.
  # If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
  # If skewness is between −½ and +½, the distribution is approximately symmetric.
  bias = ((s_n * (s_n - 1))**0.5) / (s_n - 2)
  ub_coeff = m_3rd / (ub_var**(3/2))
  return bias * ub_coeff

# moved calc for grp_mids out of conditional for tests
# add test #6, stopped plotting of histogram is test six being run
  if do_tst == 6:
    s_ub_var = get_binned_var(grp_mids, p_data, s_mean, is_pop=True)
    moment_3 = get_binned_3rd_moment(grp_mids, p_data, s_mean)
    tot_n = sum(p_data)
    coeff_skew = calc_coeff_skew(moment_3, s_ub_var, tot_n)
    print(f"unbiased var: {s_ub_var}, 3rd moment: {moment_3}, coeff skew: {coeff_skew}")

  if do_tst >= 2 and do_tst != 6:
    ax.legend()
  if do_tst != 6:
    plt.show()

And, I get the following, which jives with the number I got while working on the first post covering Descriptive Statistics.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 6
0 China 2011
unbiased var: 411.33991695119494, 3rd moment: 1953.0873458780957, coeff skew: 0.23411048000251844

As mentioned in that first post there are other measures of skewness. Let’s got through them quickly.

Pearson’s second skewness coefficient (median skewness)

$$S_{k_{2}} = 3 \frac{\bar Y - \tilde{Y}}{s}$$

where: (S_{k_{2}}) is Pearson’s second skewness coefficient,
(\bar{Y}) is the estimated sample mean,
(\tilde{Y}) is the estimated sample median and
(s) is the estimated sample standard deviation

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 6
0 China 2011
unbiased var: 411.33991695119494, 3rd moment: 1953.0873458780957, coeff skew: 0.23411048000251844
Pearson's 2nd coefficient: 0.026927189879036757

Which doesn’t quite match with the value in the first post. In that post I used the Sheppard’s corrected variance. This time I am using the uncorrected sample variance.

Bowley Skewness Formula

$$B_{1} = \frac{Q_{3} + Q_{1} - 2Q_{2}}{Q_{3} - Q_{1}}$$

The value I get, -0.033876, matches the value in the first post.

Kurtosis

You may recall we used the third moment of the distribution in calculating skewness. Kurtosis uses the fourth moment, which means that outliers in a sample have even more effect on kurtosis than they do on skewness.

$$kurtosis = \frac{\sum_{i=1}^N w_{i} (Y_{i} - \bar Y)^4 / N}{s^4}$$

where: (N) is the number of data points, which in our case is (\sum_{i=1}^n w_{i})
(\bar Y) is the estimated sample mean,
(Y_{i}) is the value of the ith variable (bin mid value to keep things consistent),
(w_{i}) is the binned weight/value for the bin with value (Y_{i}) and
(s) is our estimated unbiased standard deviation (calculated using (N), not (N-1))

Okay a new function or two, and a quick test.

def get_binned_4th_moment(b_mids, b_wts, s_mean):
  tot_wt = sum(b_wts)
  sum_4th = 0
  for i, wt in enumerate(b_wts):
    sum_4th += wt * (b_mids[i] - s_mean)**4
  moment = sum_4th / tot_wt
  return moment


def calc_kurtosis(m_4th, ub_var):
  # kurtosis for a standard normal distribution is 3
  kurtosis = m_4th / (ub_var**4)
  return bias * ub_coeff

The value I get is 2.307935 (after a bit of rounding). Near as I can make out that implies a “light tailed” distribution, i.e. not a lot of outliers. Again perhaps indicating more symmetry than what I think I see in the histogram - though not sure that is a valid interpretation. Note, a univariate normal distribution has a kurtosis of 3.

All the Measures

I modified test #6 to print out all the measures — likely to a few too many decimal places.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 6
0 China 2011

Mean, Variance and Standard Deviation:
        estimated sample mean: 35.558426
                mean lower bound: 33.058433, mean upper bound: 38.058426
        sample weighted variance: 411.340216, sheppard corrected variance: 409.256882
        sample weighted standard deviation: 20.281524, sheppard corrected std dev: 20.230098

Median, Quartiles and IQR:
        1st quartile: 19.728773, 2nd quartile (median): 35.376384, 3rd quartile: 49.998570
        IQR: 30.269797 and upper outlier value: 95.403265

Measures of Skewness:
        moment coefficient of skewness: 0.234110
        Pearson 2nd coefficient: 0.026927
        Bowley measure of skewness: -0.033876

And the code for test #6 now looks like the following. I also had to move a few calculations out of the other test sections so that I could use them in test #6.

  if do_tst == 6:
    print("\nMean, Variance and Standard Deviation:")
    print(f"\testimated sample mean: {s_mean:.6f}")
    print(f"\t\tmean lower bound: {mean_low:.6f}, mean upper bound: {mean_high:.6f}")
    print(f"\tsample weighted variance: {s_var:.6f}, sheppard\'s corrected variance: {sc_var:.6f}")
    print(f"\tsample weighted standard deviation: {s_sdev:.6f}, sheppard\'s corrected std dev: {sc_sdev:.6f}")
    print("\nMedian, Quartiles and IQR:")
    print(f"\t1st quartile: {s_q1:.6f}, 2nd quartile (median): {s_q2:.6f}, 3rd quartile: {s_q3:.6f}")
    print(f"\tIQR: {iqr:.6f} and upper outlier value: {up_out:.6f}")

    s_ub_var = get_binned_var(grp_mids, p_data, s_mean, is_pop=True)
    moment_3 = get_binned_3rd_moment(grp_mids, p_data, s_mean)
    tot_n = sum(p_data)
    coeff_skew = calc_coeff_skew(moment_3, s_ub_var, tot_n)
    print("\nMeasures of Skewness:")
    #print(f"\tunbiased var: {s_ub_var}, 3rd moment: {moment_3}")
    print(f"\tmoment coefficient of skewness: {coeff_skew:.6f}")
    print(f"\tPearson's 2nd coefficient: {((3*(s_mean - s_median))/s_sdev):.6f}")
    print(f"\tBowley's measure of skewness: {((s_q1 + s_q3 - (2*s_median)) / iqr):.6f}")

A Little Playing Around

Let’s look at getting and plotting the estimated “probability density function” (PDF) for our data. To do that we will need to install SciPy, well at least I did. There are likely other ways.

(base-3.8) PS R:\learn\py_play> conda install scipy
Collecting package metadata (current_repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.5

Please update conda by running

    $ conda update -n base -c defaults conda

## Package Plan ##

  environment location: E:\appDev\Miniconda3\envs\base-3.8

  added / updated specs:
    - scipy

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1h             |       he774522_0         4.8 MB
    scipy-1.5.0                |   py38h9439919_0        12.0 MB
    ------------------------------------------------------------
                                           Total:        16.8 MB

The following NEW packages will be INSTALLED:

  scipy              pkgs/main/win-64::scipy-1.5.0-py38h9439919_0

The following packages will be UPDATED:

  openssl                                 1.1.1g-he774522_1 -> 1.1.1h-he774522_0

Proceed ([y]/n)? y

Downloading and Extracting Packages
openssl-1.1.1h       | 4.8 MB    | ############################################################################ | 100%
scipy-1.5.0          | 12.0 MB   | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(base-3.8) PS R:\learn\py_play>

Then we need a bit of code to calculate the PDF, and plot it on the histogram. Oh yes, I added a test #7 to do that work. Don’t forget to import the necessary stuff from SciPy. I used from scipy import stats.

  if do_tst == 7:
    b_width = 5
    h_area = 0
    for pop in p_data:
      h_area += b_width * pop
    best_fit_line = stats.norm.pdf(bins, s_mean, s_sdev)
    print(f"best_fit_line = stats.norm.pdf({bins}, {s_mean}, {s_sdev}) = {best_fit_line}")
    best_fit_line *= h_area
    print(best_fit_line)
    plt.plot(bins, best_fit_line, 'r-', alpha=0.6, label='Est. normal PDF')
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 7
0 China 2011
best_fit_line = stats.norm.pdf([  0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85
  90  95 100 105], 35.55842554253051, 20.28152400048459) = [4.22989496e-03 6.32187067e-03 8.89132615e-03 1.17677216e-02
 1.46562593e-02 1.71774543e-02 1.89452079e-02 1.96627773e-02
 1.92041572e-02 1.76502376e-02 1.52654907e-02 1.24244134e-02
 9.51581360e-03 6.85836934e-03 4.65158244e-03 2.96883106e-03
 1.78309777e-03 1.00778929e-03 5.36005485e-04 2.68270943e-04
 1.26352241e-04 5.60011791e-05]
# best fit * h_area
[ 29112.20203425  43510.20009377  61194.44698875  80991.2045835
 100871.53090019 118223.62611197 130390.16933982 135328.8318062
 132172.38487972 121477.55150998 105064.55926672  85510.87812094
  65492.47446373  47202.64581587  32014.46108379  20432.94464441
  12272.14933204   6936.09789486   3689.05140355   1846.37159136
    869.61780259    385.42745239]
histogram of China 2011 population with plot of estimated probability density function overlaid

Until Another Day

Think I will call it quits right here. My plan is to now put the tests, as appropriate, into functions. Along with whatever "global" data I need to make things work. Then, in the next post, I am going to generate, plots, descriptive stats, etc for Germany, United States of America and Kenya to compare with that of China for the year 2011. Not sure how I will display it all, but will do what I can. Could be a waste of time, but I am curious as to what we might or might not find.

Until then…

Resources