Okay, on with the code to handle dispersion and skewness. We are going to see some more formulae of varying complexity. But, I am sure, nothing we can’t handle. And, because we are dealing with a sample distribution rather than of the population, we will need to make adjustments in some of the formulae. Not that I have in any way tried to sort the actual mathematics justifying those changes. Suppose I should, but…

Variance and Standard Deviation

We pretty much need to start with variance based on the definition of standard deviation.

Variance is one of those cases where the formula changes for a weighted sample. Feel free to check out the Wikipedia article in the Resources section at the bottom of the post. I don’t really get the difference between frequency weights and reliability weights. But, have decided we have frequency weights. Though I am sure more research is called for. And, may end up changing my mind at some point. But, for now…

$$s^{2} = \frac{\sum_{i=1}^n w_{i}(x_{i} - u^*)^2}{V_{1} - 1}$$

where: (s^2) is our estimated sample variance,
(x_{i}) is the value of the ith variable,
(w_{i}) is the count/weight for (x_{i})
(u^*) is the weighted mean, and
(V_{1}) is (\sum_{i=1}^n w_{i})

def get_binned_var(b_vals, b_wts, s_mean, is_pop=False):
  # is_pop: true is weighted var for population rather than sample data
  wt_tot = sum(b_wts)
  v_sum = 0
  for i, wt in enumerate(b_wts):
    v_sum += wt * ((b_vals[i] - s_mean)**2)
  if is_pop:
    sigma_sq = v_sum / wt_tot
  else:  
    sigma_sq = v_sum / (wt_tot - 1)
  return sigma_sq
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 4
0 China 2011
[85667.547, 82561.492, 86558.628, 94460.802, 127982.303, 106882.474, 95303.553, 117327.199, 127924.578, 107735.466, 84559.857, 84798.981, 61047.669, 41155.955, 31836.485, 22173.432, 11853.69, 4854.912, 1518.293, 268.595, 25.722]
sample weighted variance: 411.34021578223246

Variance is essentially the average of the squared distances between each point and the mean. And as such it is not in the same unit of measure as the sample data. E.G. if original data is in metres, the variance is in square metres. The usual solution is to use the standard deviation. Which is just the square root of the variance, and as such is once again in the same unit of measure as the data. And with respect to certain distributions it has very specific limits on the dispersion.

For a normal distribution 68% of the data lie within one standard deviation of the mean. 95% within two and 99.7 within three. (Note, some rounding involved, as well as some ignoring of the real statistics involved.)

So for fun, let’s calcuate the standard deviation and plot lines for ± 1, 2 and 3 standard deviations.

In our case the simple standard deviation is 20.28, after a bit of rounding. I did not include the negative valued lines for the standard deviation ranges (i.e. minus 2 and minus 3).

And, because we will likely be interested in getting the standard deviation directly, I have added a function to do just that, get_binned_sd().

Note: I made some changes to my test section. I added a conditional block to calculate various measures in advance for certain tests to save repeating the code to calculate them in multiple test sections. I am including the code for the whole section below.

def get_binned_sd(b_mids, b_wts, s_mean):
  ''' Calculate and return estimated standard deviation using binned data and Sheppard's correction.'''
  sigma_sq = get_binned_var(b_mids, b_wts, s_mean)
  #print(f'v_sum: {v_sum}, sigma_sq: {sigma_sq}, sigma: {sigma')
  return sigma_sq**0.5


if __name__ == '__main__':
  import argparse
  cr_nms = ['China', 'Germany', 'United States of America', 'Kenya']

  max_tst = 4
  max_cr = 4
  use_cr = 0
  use_yr = '2011'
  # check for test number on command line
  parser = argparse.ArgumentParser()
  # long name preceded by --, short by single -, get it as an integer so can use to access test data array
  parser.add_argument('--do_test', '-t', type=int, help=f'Number of statistic to display (1-{max_tst})')
  parser.add_argument('--cr_use', '-c', type=int, help=f'Use country # (1-{max_cr})')
  parser.add_argument('--yr_use', '-y', help=f'Year to use for analysis (1950 - 2020)')

  args = parser.parse_args()
  if args.do_test and (args.do_test >= 1 and args.do_test <= max_tst):
    do_tst = args.do_test
  else:
    do_tst = 1
  if args.cr_use and (args.cr_use >= 1 and args.cr_use <= max_cr):
    use_cr = args.cr_use - 1
  if args.yr_use and (int(args.yr_use) >= 1950 and int(args.yr_use) <= 2100):
    use_yr = args.yr_use

  # want just the data, nothing else, and regardless of which test we might be running
  cr_nm = cr_nms[use_cr]
  p_yr = use_yr
  print(use_cr, cr_nm, p_yr)
  p_data = pdb.get_1cr_years_all_v1(cr_nm, [p_yr])[p_yr]
  #p_data = pdb.get_1cr_years_all_v1(cr_nm, [p_yr])
  print(p_data)
  #print(LABELS)

  # display the histogram, again regardless of which test is being run, well almost
  b_lbls = [int(grp.split('-')[0]) if grp[-1] != '+' else int(grp[:-1]) for grp in LABELS]
  fig, ax = plt.subplots(figsize=(10,6))
  n, bins, patches = pop_histogram(fig, ax, cr_nm, p_yr, b_lbls, p_data)

  # want to get rid of Pylance warnings
  grp_mids = []
  s_mean = 0
  s_var = 0
  s_sdev = 

  if do_tst == 2 or do_tst == 4:
    grp_mids = get_grp_middle(LABELS, b_width=5)
    s_mean = get_wt_mean(grp_mids, p_data)
    s_var = get_binned_var(grp_mids, p_data, s_mean)
    s_sdev = get_binned_sd(grp_mids, p_data, s_mean)

  if do_tst == 2:
    #print(get_grp_middle(LABELS, b_width=5))
    #print(get_grp_middle(LABELS))
    # add line showing mean to histogram
    plt.axvline(s_mean, 0, 1, color='r', label=f'Sample Mean: {s_mean:.2f}')

    grp_lows = get_grp_low_bdrys(LABELS)
    grp_highs = get_grp_up_bdrys(LABELS)
    mean_low = get_wt_mean(grp_lows, p_data)
    mean_high = get_wt_mean(grp_highs, p_data)
    print(f"estimated sample mean: {s_mean}, mean lower bound: {mean_low}, mean upper bound: {mean_high}")
    # add lines showing estimated mean limits
    plt.axvline(mean_low, 0, 1, color='k', label=f'Mean Lower Bound: {mean_low:.2f}')
    plt.axvline(mean_high, 0, 1, color='m', label=f'Mean Upper Bound: {mean_high:.2f}')

  if do_tst == 3:
    s_median = get_wt_median(LABELS, p_data)
    print(s_median)
    plt.axvline(s_median, 0, 1, color='r', label=f'Sample Median: {s_median:.2f}')

  if do_tst == 4:
    print(f'sample weighted variance: {s_var}')
    sqr_var = s_var**0.5
    print(f'sample weighted standard deviation: {s_sdev}, square root variance: {sqr_var}')
    plt.axvline(s_mean, 0, 1, color='r', label=f'Sample Mean: {s_mean:.2f}')
    plt.axvline(s_mean + s_sdev, 0, 1, color='c', label=f'Plus/Minus 1 Std Dev, {s_sdev:.2f}')
    plt.axvline(s_mean - s_sdev, 0, 1, color='c')
    plt.axvline(s_mean + (2 * s_sdev), 0, 1, color='m', label=f'Plus/Minus 2 Std Dev, {2 * s_sdev:.2f}')
    plt.axvline(s_mean + (3 * s_sdev), 0, 1, color='k', label=f'Plus/Minus 3 Std Dev, {3 * s_sdev:.2f}')

  if do_tst >= 2:
    ax.legend()
  plt.show()
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 4
0 China 2011
[85667.547, 82561.492, 86558.628, 94460.802, 127982.303, 106882.474, 95303.553, 117327.199, 127924.578, 107735.466, 84559.857, 84798.981, 61047.669, 41155.955, 31836.485, 22173.432, 11853.69, 4854.912, 1518.293, 268.595, 25.722]
sample weighted variance: 411.34021578223246
sample weighted standard deviation: 20.28152400048459, square root variance: 20.28152400048459
histogram showing mean and standard deviation ranges

Sheppard’s correction

Sheppard’s correction is used to adjust moments calculated from binned data. There are some prerequisites for its use.

  • the data is assumed to be from a distribution on a finite interval (think that applies here)
  • that interval is divided into bins of an equal width, h, that is relatively small in proportion to the total data (again seems to apply)
  • that the distribution mentioned above has a continuous density function (not sure about this one, but probably)

So, let's write a function to give us the Sheppard's corrected variance. To do so, we just subtract \(\frac{h^2}{12}\) from the sample variance.

def var_sheppard(var, b_wd=5):
  ''' Return estimated standard deviation using Sheppard's correction given variance.'''
  s_correct = (b_wd**2) / 12
  sc_var = var - s_correct
  return sc_var

# and modify our test section(s) accordingly, firstly the measure early calculation section

  # want to get rid of Pylance warnings
  grp_mids = []
  s_mean = 0
  s_var = 0
  s_sdev = 0
  sc_var = 0
  sc_sdev = 0

  if do_tst == 2 or do_tst == 4:
    grp_mids = get_grp_middle(LABELS, b_width=5)
    s_mean = get_wt_mean(grp_mids, p_data)
    s_var = get_binned_var(grp_mids, p_data, s_mean)
    s_sdev = get_binned_sd(grp_mids, p_data, s_mean)
    sc_var = var_sheppard(s_var, b_wd=5)
    sc_sdev = sd_sheppard(s_var, b_wd=5)

# and test #4
  if do_tst == 4:
    print(f"sample weighted variance: {s_var}, sheppard's corrected variance: {sc_var}")
    sqr_var = s_var**0.5
    print(f"sample weighted standard deviation: {s_sdev}, square root variance: {sqr_var}, sheppard's corrected std dev: {sc_sdev}")
    plt.axvline(s_mean, 0, 1, color='r', label=f'Sample Mean: {s_mean:.2f}')
    plt.axvline(s_mean + s_sdev, 0, 1, color='c', label=f'Plus/Minus 1 Std Dev: {s_sdev:.2f}')
    plt.axvline(s_mean - s_sdev, 0, 1, color='c')
    plt.axvline(s_mean + (2 * s_sdev), 0, 1, color='m', label=f'Plus/Minus 2 Std Dev: {2 * s_sdev:.2f}')
    #plt.axvline(s_mean - (2 * s_sdev), 0, 1, color='m')
    plt.axvline(s_mean + (3 * s_sdev), 0, 1, color='k', label=f'Plus/Minus 3 Std Dev: {3 * s_sdev:.2f}')

And, the result:

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 4
0 China 2011
sample weighted variance: 411.34021578223246, sheppard's corrected variance: 409.25688244889915
sample weighted standard deviation: 20.28152400048459, square root variance: 20.28152400048459, sheppard's corrected std dev: 20.23009842904624

Not much of a change. So, I think I will just use the uncorrected calculations. Now, let’s look at quartiles.

Quartiles and IQR

The median in fact is a quartile. The second. We also want to get the 1st and 3rd which are used to calculate the Interquartile Range (IQR). As well as being used in some of our measures of skewness. I am guessing the name of this measure pretty much tells us what it represents. Quartiles reflects quarters. So the first quartile is the value that represents the first 1/4 of the data. The second, i.e. the median, represents the value below which two quarters, half, of the data would be found. And the third is the vaLue below which 3/4 of the data sits. Conversely, above which 1/4 of the data sits.

All the usual assumptions will be used. I am going to write a new function, get_wt_quartile(b_lows, b_wts, qnbr=2). It will of course look similar to the one we wrote earlier to calculate just the median. I have decided to pass in the low boundaries, as numbers, for the bins rather than the labels. I believe this further separates concerns as the function doesn’t even need to know that it has to call get_low_grp_bound(). I am also going to change my test code to use this function to calculate the median as well as the other two quartiles. For now, I will also add code to compare the old median function output against this new one. More or less validate the new function.

# function definition
def get_wt_quartile(b_lows, b_wts, qnbr=2):
  trgt_val = sum(b_wts) * qnbr / 4
  c_sum = 0
  i = 0
  for i, wt in enumerate(b_wts):
    if c_sum + wt < trgt_val:
      c_sum += wt
    else:
      break
  need_frac = (trgt_val - c_sum) / b_wts[i]
  g_low = b_lows[i]
  g_high = b_lows[i+1]
  est_quart = g_low + ((g_high - g_low) * need_frac)
  return est_quart

# Changes in the test section

# above the tests
  grp_mids = []
  s_mean = 0
  s_var = 0
  s_sdev = 0
  sc_var = 0
  sc_sdev = 0

# moved the following from test #2 to here so available to all subsequent tests
  grp_lows = get_grp_low_bdrys(LABELS)
  grp_highs = get_grp_up_bdrys(LABELS)

  if do_tst == 2 or do_tst == 4:
    grp_mids = get_grp_middle(LABELS, b_width=5)
    s_mean = get_wt_mean(grp_mids, p_data)
    s_var = get_binned_var(grp_mids, p_data, s_mean)
    s_sdev = get_binned_sd(grp_mids, p_data, s_mean)
    sc_var = var_sheppard(s_var, b_wd=5)
    sc_sdev = sd_sheppard(s_var, b_wd=5)

  s_median = 0
  s_q1 = 0
  s_q2 = 0
  s_q3 = 0
  iqr = 0
  
  if do_tst == 3 or do_tst == 5:
    s_median = get_wt_median(LABELS, p_data)
    s_q2 = get_wt_quartile(grp_lows, p_data)
    s_q1 = get_wt_quartile(grp_lows, p_data, qnbr=1)
    s_q3 = get_wt_quartile(grp_lows, p_data, qnbr=3)
    iqr = s_q3 - s_q1

# and the new test #5, also adjusted max_tst = 5
  if do_tst == 5:
    print(f"s_median: {s_median} == s_q2: {s_q2} -> {s_median == s_q2}")
    #plt.axvline(s_median, 0, 1, color='r', label=f'Sample Median: {s_median:.2f}')
    plt.axvline(s_q1, 0, 1, color='c', label=f'1st Quartile: {s_q1:.2f}')
    plt.axvline(s_q2, 0, 1, color='r', label=f'Sample Median (2nd Quartile): {s_q2:.2f}')
    plt.axvline(s_q3, 0, 1, color='m', label=f'3rd Quartile: {s_q3:.2f}')
    print(f"1st quartile: {s_q1}, 2nd quartile (median): {s_q2}, 3rd quartile: {s_q3}")

And, the results agree with my test functions used to produce the first post, Descriptive Statistics — Preamble.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 5
0 China 2011
s_median: 35.37638405993141 == s_q2: 35.37638405993141 -> True
1st quartile: 19.728773171436764, 2nd quartile (median): 35.37638405993141, 3rd quartile: 49.998569772279076

As mentioned in a previous post, the Interquartile Range (IQR), is the distance between the 1st and 3rd quartile. It is the measure of dispersion for the median. This value, IQR, is also used to identify outliers in the data distribution. An outlier is any value below Q1 - (1.5 * IQR) or above Q3 + (1.5 * IQR). In our case the lower outlier value is negative and not of much use. But, let’s add the upper value to our histogram along with the quartiles.

histogram quartiles, including median, and upper outlier value

Would appear we have some potential outliers. But we should perhaps consider that that value is very close to the estimated mean plus 3 standard deviations. Seems to me that there is some implied meaning in that similarity. Wish I knew a lot more about statistics.

I have not tried generating a boxplot (see Resource below). But it may be something I return to at a later date.

Time for Another Break

I had also planned to cover measures of shape and/or skewness in this post. But, I do believe it is getting a might long and has again taken me quite some time to complete. So, I will cover those items in the next post. Still on a weekly schedule.

Resouces