Time to have a look at some code. Let’s sort out calculating all those descriptive measures we talked about last time. In this post I will most look at doing it by hand. In the next one we will look at making better use of available packages, e.g. scipy. Though we will in fact use scipy in this post near the end to sort the probability density function for our data, unless I decided to do it the hard way.

Because our data is all binned, we will only look at coding functions to determine estimate measures for binned data. We’ll start with measures of centrality, move on to measures of dispersion and finish with measures of shape. Maybe should start with the last, but it tends to require that we have all the preceding ones first. So…

Let’s create a new file in the project for working on descriptive statistics at the same level as population_by_age.py. Say, population/pop_desc_stat.py.

For this post I will be using only the population data for China in 2011.

Histogram

First things first. Plot your data!! Okay, we have plotted it (bar chart or grouped bar chart). But, we haven’t yet plotted a histogram. We can use hist() to get the job done. It provides for plotting binned data. I got that to work fairly quickly, but initially just couldn’t get values on the x-axis to match the location of the bars. At least without a hack I didn’t like. Took me a while to realize that I had to provide an iterable of appropriate x-values/labels. You may recall I chose to label the left of each bar with the lowest age for that group. I didn’t bother adding the bar value annotations on hovering. Not sure they would work as currently written.

If we assume I have the bar heights in p_data, and a list of lowest age for each group in b_lbls, my first attempt looked something like this:

n, bins, patches = ax.hist(b_lbls, weights=p_data, bins=21)
first failed attempt at plotting histogram Note that none of the bars align properly with a lower age boundary. And there is no gap between the rightmost bar and the chart border?

I, without reading the documentation, messed about, finally trying bins=20. For some reason it almost worked. But the space between the last bin and the border on the left is still missing. second failed attempt at plotting histogram Finally I broke down and did some reading/research. That’s when I found out I could/should supply a range of x-values for the bins to go along with the weights (data) and labels.

n, bins, patches = ax.hist(b_lbls, weights=p_data, bins=range(0, 110, 5))

Eh, voilà! a proper looking histogram of the population data for China in 2011 As I may eventually convert this module to be used as an include in other modules, I will set up the usual test area in a if __name__ == '__main__' block. All my global variables and functions will be defined in the module/package scope. I am going to start by importing the chart/chart and database/population modules. Then I will define a global containing the default chart labels using chart.get_agrp_lbls(). And in the test area I will define a variable p_data that contains the group counts for China’s population in 2011. And, don’t forget our use of a command line variable to deterine which test to run. With a quick test (print defined variables) that looks something like:

from chart import chart
from database import population as pdb

LABELS = chart.get_agrp_lbls()

if __name__ == '__main__':
  import argparse
  max_tst = 1
  # check for test number on command line
  parser = argparse.ArgumentParser()
  # long name preceded by --, short by single -, get it as an integer so can use to access test data array
  parser.add_argument('--do_test', '-t', type=int, help=f'test number (1-{max_tst})')

  args = parser.parse_args()
  if args.do_test and (args.do_test >= 1 and args.do_test <= max_tst):
    do_tst = args.do_test
  else:
    do_tst = 1

  # want just the data, nothing else, and regardless of which test we might be running
  cr_nm = 'China'
  p_yr = '2011'
  p_data = pdb.get_1cr_years_all_v1(cr_nm, [p_yr])[p_yr]

  print(LABELS)
  print(p_data)

And the output looks like:

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population/pop_desc_stat.py
['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90-94', '95-99', '100+']
[85667.547, 82561.492, 86558.628, 94460.802, 127982.303, 106882.474, 95303.553, 117327.199, 127924.578, 107735.466, 84559.857, 84798.981, 61047.669, 41155.955, 31836.485, 22173.432, 11853.69, 4854.912, 1518.293, 268.595, 25.722]

Okay, now let’s display that histogram. I will build a small, preliminary function to do just that, pop_histogram(). But, it just generates the histogram and returns the results. The plot/subplot itself is created outside the function and the necessary variables passed as arguments. And, I have amended the test area to generate and display the histogram. How’s your code compare to mine?

def pop_histogram(fig, ax, cr_nm, p_yr, labels, wts, bins=range(0, 110, 5)):
  # Because we may wish to add other items to the plot, creation of subplot is left to the user, so we have parameters for fig and ax
  # Because this is designed to plot a country's population data from our csv file, I know that what the bins range should be
  # Probably also know what the labels should be, but...
  # cr_nm (country name), p_yr (year for population data) and wts (population counts by age group for that country and year)
  #   are going to vary based on callers needs
  plt.xlabel('Age')
  plt.ylabel('Population (1000s)')
  plt.title(f'{cr_nm}: {p_yr}')
  #fig.tight_layout()
  return ax.hist(b_lbls, weights=p_data, bins=bins)

# and in the test area I removed the print statements and added the following

  # display the histogram
  b_lbls = [int(grp.split('-')[0]) if grp[-1] != '+' else int(grp[:-1]) for grp in LABELS]
  fig, ax = plt.subplots(figsize=(10,6))
  n, bins, patches = pop_histogram(fig, ax, cr_nm, p_yr, b_lbls, p_data)
  plt.show()

I am not discussing some of the other parameters available in plt.hist(). E.G. align=. I leave it to you to check out the documentation (see Resources). Now, onto measures of central tendency.

I also refer you to the previous post for comments on initial observations regarding the sample data distribution given the histogram.

Mean

Let’s start with the mean. We have one variable, age. But rather than having a whole list of individual ages, we have a count for a single “age”. Well an age range, but bear with me. If we had a count for each individual age, we would calculate a weighted mean:

$$\bar x = \frac{\sum_{i=1}^n w_{i}x_{i}}{\sum_{i=1}^n w_{i}}$$

where: (\bar x) is our sample mean,
(x_{i}) is the value of the ith variable,
(w_{i}) is the count/weight for (x_{i}), and
n is the total number of variables

So, we should be able to estimate the sample mean by choosing a suitable variable value to represent each of our 5 year wide bins. I used the middle value. Which for the first bin would be 2.5. For the second, 7.5, etc. Too keep things simple, I used 102.5 for the 100+ bin.

Let’s start by writing a function to return a list of middle bin/group values, given the list of bin labels, i.e [‘0-5’, …, ‘40-44’, …, ‘100+’]. Since we are going to be using the returned list to do arithmetic, we will return the list values as floats. I did think about “cheating” and just returning an array of the values that I knew would be in the list, but I decided to actually write code to parse the input list, maybe learn/practice something worthwhile. I also decided to add the option of including a bin width parameter to provide an alternate, perhaps more straight forward, path through the function. If the bin with is known and regular, makes life simpler.

# added function definition in main module space
def get_grp_middle(g_lbls):
  # group labels of from 'x-y' except for last which is 'x+' where x,y == some number and x < y
  #    and x(i) < x(i+1), and x(i) >= y(i-1)
  # if bin width provided use it, otherwise calculate individually for each adjacent set of group labels
  m_ages = []
  if b_width:
    hlf_wd = b_width / 2
    m_ages = [int(grp.split('-')[0]) + hlf_wd if grp[-1] != '+' else int(grp[:-1]) + hlf_wd for grp in g_lbls ]
  else:
    for i in range(len(g_lbls) - 1):
      b_low = int(g_lbls[i].split('-')[0])
      if g_lbls[i+1][-1] != '+':
        b_high = int(g_lbls[i+1].split('-')[0])
      else:
        b_high = int(g_lbls[i+1][:-1])
      b_width = b_high - b_low  
      m_ages.append(b_low + (b_width /2 ))
    # don't forget the last one  
    m_ages.append(int(g_lbls[-1][:-1]) + (b_width / 2))
  return m_ages

# added quick tests - print statements
# also updated max_tst = 2
...
  n, bins, patches = pop_histogram(fig, ax, cr_nm, p_yr, b_lbls, p_data)

  if do_tst == 2:
    print(get_grp_middle(LABELS, b_width=5))
    print(get_grp_middle(LABELS))

  plt.show()
(base) PS R:\learn\py_play> conda activate base-3.8
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 2
[2.5, 7.5, 12.5, 17.5, 22.5, 27.5, 32.5, 37.5, 42.5, 47.5, 52.5, 57.5, 62.5, 67.5, 72.5, 77.5, 82.5, 87.5, 92.5, 97.5, 102.5]
[2.5, 7.5, 12.5, 17.5, 22.5, 27.5, 32.5, 37.5, 42.5, 47.5, 52.5, 57.5, 62.5, 67.5, 72.5, 77.5, 82.5, 87.5, 92.5, 97.5, 102.5]

Okay, now a function to calculate the mean, given the middle group values and the group weights. Should be pretty straightforward, see you back here when you’re done.

# added function definition in main module space
def get_wt_mean(g_val, g_wts):
  tot_wts = sum(g_wts)
  weighted_tot = 0
  for ndx, m_age in enumerate(g_val):
    weighted_tot += (g_wts[ndx] * m_age)
  b_avg = weighted_tot / tot_wts
  return b_avg

# modified test to calc mean and print it
  if do_tst == 2:
    grp_mids = get_grp_middle(LABELS, b_width=5)
    s_mean = get_wt_mean(grp_mids, p_data)
    print(f"sample mean: {s_mean}")
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 2
sample mean: 35.55842554253051

Okay now let’s plot the mean on the histogram.

# modified and surrounding code in the test namespace

  # display the histogram, again regardless of which test is being run, well almost
  b_lbls = [int(grp.split('-')[0]) if grp[-1] != '+' else int(grp[:-1]) for grp in LABELS]
  fig, ax = plt.subplots(figsize=(10,6))
  n, bins, patches = pop_histogram(fig, ax, cr_nm, p_yr, b_lbls, p_data)

  if do_tst == 2:
    #print(get_grp_middle(LABELS, b_width=5))
    #print(get_grp_middle(LABELS))
    grp_mids = get_grp_middle(LABELS, b_width=5)
    s_mean = get_wt_mean(grp_mids, p_data)
    print(f"sample mean: {s_mean}")
    # add line showing mean to histogram
    plt.axvline(s_mean, 0, 1, color='r', label=f'Sample Mean: {s_mean:.2f}')

  if do_tst >= 2:
    ax.legend()
  plt.show()

You did remember to add a legend indicating what the line represented and to round it a little. I am not adding an image of the modified plot. Please refer to the previous post. And, now what about the limits on our estimated mean? As discussed in that previous post it is pretty straightforward. But, the long way is to calculate the mean using both the lower and upper bounds for each group rather than the middle (average) value. So, a couple more functions, a few more calculationa and lines added to the plot.

There are a couple of modifications I made to that low/high boundary value. For the low boundary I did not use 0 (zero). I used 1 hour. No one is 0 years old. For the upper boundary I used the low age of the next group. For the last I just used the age plus the width of the previous group. All seemed reasonable to me. Again because we will be doing arithmetic with these values I am returning them as numbers not strings.

# the two new functions
def get_grp_low_bdrys(g_lbls):
  # group labels of from 'x-y' except for last which is 'x+' where x,y == some number and x < y
  #    and x(i) < x(i+1), and x(i) >= y(i-1)
  g_low = [int(grp.split('-')[0]) if grp[-1] != '+' else int(grp[:-1]) for grp in g_lbls ]
  # set lowest boundary to 1 hour of age, not zero, an assumption on my part
  g_low[0] = 1 / (365 * 24)
  return g_low


def get_grp_up_bdrys(g_lbls):
  # width of 2nd last bar
  b_width = int(g_lbls[-1][:-1]) - int(g_lbls[-2].split('-')[0])
  # add low age in 2nd to last group to list
  ub_ages = [int(grp.split('-')[0]) if grp[-1] != '+' else int(grp[:-1]) for grp in g_lbls[1:]]
  # add last boundary to list
  ub_ages.append(int(g_lbls[-1][:-1]) + b_width)
  return ub_ages

# print out estimated mean and limits, add to plot of histogram
  if do_tst == 2:
    grp_mids = get_grp_middle(LABELS, b_width=5)
    s_mean = get_wt_mean(grp_mids, p_data)
    # add line showing mean to histogram
    plt.axvline(s_mean, 0, 1, color='r', label=f'Sample Mean: {s_mean:.2f}')

    grp_lows = get_grp_low_bdrys(LABELS)
    grp_highs = get_grp_up_bdrys(LABELS)
    mean_low = get_wt_mean(grp_lows, p_data)
    mean_high = get_wt_mean(grp_highs, p_data)
    print(f"estimated sample mean: {s_mean}, mean lower bound: {mean_low}, mean upper bound: {mean_high}")
    # add lines showing estimated mean limits
    plt.axvline(mean_low, 0, 1, color='k', label=f'Mean Lower Bound: {mean_low:.2f}')
    plt.axvline(mean_high, 0, 1, color='m', label=f'Mean Upper Bound: {mean_high:.2f}')

  if do_tst >= 2:
    ax.legend()
  plt.show()

Again, see previous post for image of resulting plot. But here’s the printed values.

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/pop_desc_stat.py -t 2
estimated sample mean: 35.55842554253051, mean lower bound: 33.05843264708349, mean upper bound: 38.058425542530514

Lots of code for something SciPy would likely give us in a line or two! And, we’re not done just yet.

Median

The median of a data distribution is the value that has exactly half the data values below it and half the data values above it. Again, because we do not have every single data value in our sample, but rather binned values, we need to do some estimating.

There isn’t a formula as such that we can use. With a bunch of individual data points, you’d sort them and select the middle value. Well, the middle value if you have an odd number of data values. The arithmetic mean of the two central values if you have an even number of data points. Our data is sorted in a fashion. But it is also binned. So we need to total our bin weights (in one direction or another) until adding a bin will exceed half the total weights. We then have to estimate what portion of that last bin is needed to get half the total weights. Then use that to interpolate the median age. Sounds more complicated than it is.

Do recall that we are assuming an even distribution of population for each age in the group. Something that is likely not correct. But, for our purposes should be an acceptable assumption.

Give it a try in a new function, get_wt_median(). While you’re at it add a new test section (do_test == 3), and generate a histogram with the median plotted on it (similar to what we did with the mean). Note, I am going to add bins lowest age to highest.

I have also added another function in the interest of the separation of concerns. I was using code to pull the lower age from the group label within get_wt_median(). Seems to me, a function to calculate the median really shouldn’t know anything about the specifics of the group labels. So, get_low_grp_bound() was born.

def get_low_grp_bound(g_lbl):
  # because likely being used in calculation return float
  if g_lbl[-1] == '+':
    return float(g_lbl[:-1])
  else:
    return float(g_lbl.split('-')[0])


def get_wt_median(g_lbls, g_wts):
  trgt_val = sum(g_wts) / 2
  c_sum = 0
  i = 0
  for i, wt in enumerate(g_wts):
    if c_sum + wt < trgt_val:
      c_sum += wt
    else:
      break
  need_frac = (trgt_val - c_sum) / g_wts[i]
  # unlikely to be the last bin, so ignoring possibility of '100+' as bin label
  g_low = get_low_grp_bound(g_lbls[i])
  g_high = get_low_grp_bound(g_lbls[i+1])
  est_median = g_low + ((g_high - g_low) * need_frac)
  return est_median


# and in the test section
  if do_tst == 3:
    s_median = get_wt_median(LABELS, p_data)
    print(s_median)
    plt.axvline(s_median, 0, 1, color='r', label=f'Sample Median: {s_median:.2f}')

Upper and lower limits on the estimated median? Well a little thought should sort that out. We know that the median can not be in the 30-34 group. The total population for people aged 0 to 34.99 is less than half the total population for China in 2011. So the median has to be at the very least in the 35-39 group. It can’t be in the 40-44 group, since the total population for people 0-39.99 is significantly higher than half the annual total. So, the median must be between 35 and 40. That is within the 35-39 age group. And that is true regardless of how the population is distributed by age within that group. For that reason I didn’t plot the lower and upper limits on the histogram.

Time for a Break

I had planned to cover measures of dispersion and skewness in this post. But, I do believe it is getting a might long and has again taken me quite some time to complete. So, I will cover those items in the next post. Still on a weekly schedule.

Resources