Average, Range, Shape…

Well, I wanted to do something with that population data other than plot it. Though pretty much every statistics article says that the first thing you should do is plot your data. Look at the shape, consider the spread, the consistency and validity of the data, etc.

So, even though we don’t have data for every single age, I am still going to try and determine the world’s average age for any given year in the data. I know we could just determine all the countries (i.e. excluding regions) in the data, add stuff up, make some assumptions and work out an average (mean, medium or mode). But, hardly any fun that. (I am ignoring that there is “World” set of data in the CSV file.) And, I might also, given we have the data, look at comparing averages for females and males. Maybe even sort out continents (data regarding countries by continent will be needed) and compare them.

This will probably be a might boring. And, in my case, rather undisciplined. There will also be little or not code covered in this post — I will do that in the next post. “But don’t learn if you don’t try” — or somesuch.

Summarizing Data

When it comes to looking at large quantities of data, it usually makes sense to try to “summarize” or “describe” that data in some way. Being able to look at a table showing the age and sex of every person in some country or three is hardly enlightening. Take our population age project. Having all those individual bits of information would not tell us much of anything. Our charts are a bit more informative. Even though we only have information for 5 year groups of age. Well, except for the last, which could contain a count for more than 5 years. We only know it contains a count for the number of people 100 years old or older. But…

The subject of ‘descriptive statistics’ deals with exactly this desire to summarize and/or describe a set of data. The three general areas of interest are:

distribution: a summary of the individual values or range of values for some variable (tabular and/or graphical)
central tendency: some idea of the central value of the data distribution
dispersion: how much does that data spread around the central value

Data Distribution

The simplest distribution would be a list of every variable value and a count of the number of cases with that value. In our case, for a given sample, a list of the age groups with a count or percentage of the number of people in that age group. For example:

Population of China

Age Group	2010	2011
0-4	84639.255	85667.547
5-9	82802.732	82561.492
10-14	87977.487	86558.628
15-19	99114.906	94460.802
20-24	130144.737	127982.303
25-29	101337.821	106882.474
30-34	97621.691	95303.553
35-39	121906.517	117327.199
40-44	126529.87	127924.578
45-49	102728.541	107735.466
50-54	83014.21	84559.857
55-59	83991.44	84798.981
60-64	56477.432	61047.669
65-69	39887.004	41155.955
70-74	31741.534	31836.485
75-79	21607.492	22173.432
80-84	11199.856	11853.69
85-89	4493.804	4854.912
90-94	1346.792	1518.293
95-99	224.499	268.595
100+	22.984	25.722

Lot’s of information, not much insight. And, that’s why it is always recommended one plot the data and have a look before getting too carried away calculating summary variables of some sort or other. So for the above data:

China population for years 2010 and 2011, all age groups

Maybe a little more information. We now know the data distribution is not symmetrical. It appears to be skewed right. And the average age is likely between 20 and 44. And, that the age groups 20-44 and 40-44 seem to have the most people. Though 35-39 is not far off either of those two groups. A little progress.

Now, we could likely, with some effort, have garnered those generalizations from the table of data. But sure was way easier to do so from the chart. Though perhaps not quite traditional, the chart is very much histogram-like. And, this of course supports the idea that one should always plot their data — no matter what they plan to do with it.

A more traditional histogram might look something like the following. Apologies if anything missing, just wrote some quick code using plt.hist() to produce it. This is effectively a binned histogram of the population data for China in 2011.

histogram of the population data for China in 2011

Average?

Don’t think anyone will disagree that when we talk about an “average” we are looking for a number that represents the centre of our data. Whether that data be a population—in the statistical sense—or a sample. In the current scenario, the world would be the population and a single country would represent a sample.

There are three common choices for describing the “average”. The mean, median and mode. Nice that they all start with the letter m.

Mode

Let’s get rid of the mode right from the start. It’s not a bad idea — the central value is the one with the greatest number of members. In our case the one with the most people in an age group. If you look at our chart, there are for China in 2011 two age groups with vary similar numbers. And, that is the crux of the issue with using the mode for central tendency. Any given data sample can contain multiple modes.

In our sample for China in 2011 there are two age groups that are so close in numbers that you could hardly call one a better choice than the other. Especially given the potential errors in determining the population of each age group. We effectively have a bimodal distribution. So, mode is off the list.

Mean

When most people hear average they in fact are thinking of the arithmetic mean or just mean. Essentially you add up all the values for your sample variable and divide by the number of values in the sample. We can’t really do that as we don’t have the age for every single individual in our sample, e.g. China in 2011. But we can estimate a value for the mean using some assumptions and calculate a weighted mean. We will first assume that every age in a given group has exactly the same number of people in it.

So, if the age group 90-94 has 50,000 people in it, we assume there are 10,000 people with the age of 90, 91, 92, 93 and 94. But that would also be true if would looked at half years instead of full yeats, though the value for each such group/class would of course be 5,000. Given that each group/class is 5 years wide, the middle of the class (class mark?) would be the lowest age plus 5/2. And, that is the value we would use for each class in calculating a weighted total of ages. I am assuming a mean age of 102.5 for the 100+ age group. No need to complicate life.

You may be thinking, as I did orginally, that the middle age would be 2. And it is if we look only at the year completed, but age is a more or less continuous variable, not intergral. So the the first group/class goes from roughly 0 (well 1 second or so, since no one living can be of 0 age) to almost 5 years. So the middle age would in fact be 2.5.

I wrote a bit of code in a test file and used the data for China 2011 to get a estimated mean age of 35.56 (after a bit of rounding). Now for each age group/class the majority of people could be at the lower end of range or the higher. So, if we use the boundary ages for each group we should be able to get a pretty good idea of the minimum and maximum values for the mean. Note: I used 1 hour as the minimum boundary for the 0-4 class. And I used the upper boundary plus 1 year minus an hour as the upper value for each class. For the 100+ class that would be just shy of 105. Assumptions, assumptions! A little more coding and we have a mean somewhere between 33.06 and 38.05.

histogram of China population in 2011 showing the estimated mean age and its upper and lower boundaries

Now, you may have realized, looking at the arithmetic, that as we were using the middle value of bins of equal width, if we used the low value for the bin the mean would be 1/2 a bin width less. Similarly if we used the upper bin value the maximum for the mean would be 1/2 a bin width more. And saved yourself some extra coding.

Does the mean look like a reasonable estimate of the data distributions central tendency? In general, the mean is not considered a good measure of centrality for skewed distributions. Let’s look at an exaggerated example.

Let’s say we have collected the annual income from a small sample of people. In our case, there are two people whose income is many times more than that of all the others. The remaining incomes are not horribly out of whack in any which way. Though certainly more samples would have helped. The data distribuion, with the calculated mean, is shown below.

histogram of a very skewed sample of incomes, arithmetic mean is shown

Of a sample of 30 only 10 people have an income greater than the mean. Though exaggerated this is why the mean is not normally used to describe the centre of a skewed distrbution. Or one with notable outliers. Let’s look at what is.

Median

The median is the value that has half the sample entries below it and the other half above it. In our case, we want the age where 50% of the sample is younger and 50% of the sample is older. Rather hard to tell just by looking. But, I’m thinking it would be closer to 40 than to 35. I epxect the median to be higher than the mean because the chart is skewed right. With fewer older individuals, the arithmetic mean should end up further left than what might be considered the central tendency of our data distribution.

Again, I wrote (well re-wrote) some test code to sort that out. We will use the same assumptions we used for the mean. Primarily that in each age group every included age has an equal number of people So, we start adding age group population counts until adding another would exceed half the total population for that country and year. We then determine how many more people are required to get half the population, and use that to interpolate the median age. I calculated a median of 35.38 (I originally miscalculated it as 40.35).

histogram of China population in 2011 showing the estimated median age

I have not shown the minimum and maximum values for the possible median. But, we do know that the median must be in the bin containing our estimated median. There aren’t enough bodies in the first 7 bins to add up to 1/2 the total population. And, can’t be in the next bin because adding this bins population total would put us over the halfway mark. If the distrbution of ages in the bin is skewed right, the median will be lower. If skewed left it will be higher. But it must be greater than or equal to 35 and lest than 40.

So which average should we use for our case. A little tough to tell from the one sample. Especially since the mean and median are virtually identical. However, most census bureaus use the median when providing averages for a variety of population related data. But, let’s look at the next item on our list of descriptive statistics — dispersion.

Dispersion

When speaking about dispersion we are looking at describing or measuring the spread of our sample’s data values around the centre or the estimated measure of centrality. The first one that might come to mind is the range. That is the difference between the lowest and highest values in our data sample. Another you might be aware of is standard deviation. A measure that is specific to the mean, and therefore, in general, symmetric data distributions. It is frequently mentioned, in one way or another, in news articles mentioning descriptive statistics of some sort. One you are perhaps not acquanted with is the interquartile range. The latter is used when discussing dispersion related to the median. Let’s have a ‘quick’ look at each.

Range

In our case we don’t have an exact value for either the youngest person or the oldest in our sample. We do know that the lower end of our range is zero. Or more precisely some age greater than 0, and less than 5. All we know about the upper end is that the oldest age is greater than 100. If we do a bit of a search, we will find that the oldest known person still living is just over 117 years old. The oldest known person ever was about 122.5 years old when she passed away. So, we can be pretty certain our range lies somewhere between 100-123 (i.e. the oldest age minus the youngest). But, we don’t know what it is for sure. And that’s pretty much going to be the range for virtually every country in the world, given we are talking about a population sample’s ages.

Standard Deviation

There are a number of approaches to measuring the spread of data around the mean. One might think that an obvious one would be to add up, for every data point, the difference from the mean and divide by the number of data points. Well, that isn’t really going to help. That value is always going to be 0. The negative differences are equal to the postive differences (if we ignore potential cumulative arithmetic error). That is in fact the whole point of the arithmetic mean. It is meant to be the value that would make that very sum of negative and positive differences equal zero.

The most common way to eliminate the negative sums in the preceding calculation is to square all the differences before dividing by (n-1), where n is the number of values in our data set. (Sorry don’t yet understand why n-1, rather than n. Has something to do with sample versus population I believe.) This is a bit of tough calculation for binned data like ours. But it is possible. See: Standard deviation of binned observations, specifically the section covering Sheppard’s corrections and Maximum Likelihood Estimates.

Interquartile Range

Quartiles divide a data set into four groups with equal counts. This is a set of quartile marks, (Q1, Q2, Q3). Where Q2 would of course be the median. The interquartile range (IQR) is the distance between the first and third quartile marks. In our case, the quartiles are: (19.73, 35.38, 50.0). There is also something known as the five-number summary which adds the minimum and maximum data value to the set of quartiles. Something we really can’t do.

But a little more coding and I get the following.

histogram of China population in 2011 showing the estimated quartiles

Looking at the numbers we have:

width Q!: 19.73
width Q2: 15.65
width Q3: 14.62
width Q4: > 50.0

Those numbers would seem to imply some skewness in our data sample. Though the mean and median being virtually identical would not.

There is another way the IQR is used. Specifically to define outliers, that is any values significantly below Q1 and above Q3. The accepted arithmetic says an outlier is any value below Q1 - (1.5 * IQR) or above Q3 + (1.5 * IQR). For this data distribution those numbers are: -25.68 and 95.40. So looks like we have some outliers. A good number of the people in the 95-100 bin and all of the people in the 100+ bin.

Measures of Shape

There are two common measures of shape: skewness and kurtosis (see resources below for links). Skewness is a measure indicating the lack of symmetry in a data distribution. I say lack because a symmetrical distribution typically has a skewness of 0. So a value other than 0, indicates the degree and possibly the direction of skewness in the distribution. Kurtosis attempts to measure the degree to which a data distribution has outliers relative to a normal distribution. For the present I am not going to look at kurtosis.

There are a few ‘common’ measures for skewness. I am going to look at a couple of them. But do remember, a histogram of your data will give a real quick and visual look at both skewness and kurtosis.

Adjusted Fisher-Pearson Coefficient of Skewness

It “expresses skewness in terms of the ratio of the third cumulant κ3 to the 1.5th power of the second cumulant κ2.” Yeah, I know! (see: Wikipedia’s article on Skewness) This is the workhorse of skewness measures.

The value I got for this measure is 0.234110 (after a bit of rounding)

Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. So we are skewed right, which appears to be correct. On page 63, of ‘Principles of Statistics’, Bulmer, M. G., 1979, Dover, Bulmer suggests that:

If skewness is less than −1 or greater than +1, the distribution is highly skewed.
If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
If skewness is between −½ and +½, the distribution is approximately symmetric.

Our number would imply this sample data distribution is approximately symmetric. Though looking at the histogram it certainly seems lightly skewed right.

Pearson’s second skewness coefficient (median skewness)

A multiple of the nonparametric skew. Not the most precise, but relatively easy to calculate. (See: Pearson’s Coefficient of Skewness). It’s values are in the interval -3 to +3. This coefficient compares the sample distribution to a normal distribution. The bigger the absolute value, the larger the difference from a normal distribution. The larger the value, the larger the distribution differs from a normal distribution. Again, 0 implies no skewness. A negative value says the mean is to the left of the median, a positive value the opposite.

The value I got for this measure is 0.026996 (after a bit of rounding). Which being so small seems to imply a symmetrical distribution. Which would agree with our values for the mean and median.

Bowley Skewness Formula

This is a quartile based measure. It generates a value between -1 and +1. With the sign indicating the direction of skewness.

The value I got for this measure is -0.033876 (after a bit of rounding). Which imples the distribution is skewed left (negatively). But the value is so small, that again we are likely looking at a symmetrical distribution.

One article I looked at, “Skewness Introduction, formula, Interpretation” said that if Mean > Median > Mode then the distribution is positively skewed (i.e. skewed right). And, if Mode > Median > Mean then it is negatively skewed (i.e. skewed left). Our’s looks positively skewed. Yet the median is virtually identical the mean implying our distribution is symmetrical. The two potential modes are either side of the mean and median. Though not sure if that really implies anything.

In “Symmetric And Skewed Data* we are told that a distribution that is skewed right (positively) has the following properties:

the mean is typically greater than the median
the tail of the distribution is longer the right hand side
the median is closer to the first quartile than to the third

Looking at the quartiles, median - q1 = 15.64 and q3 - median = 14.62. So the median is a little closer to Q3 which would imply a slight negative skewness. But, none of our skewness measures would support that decision.

No wonder I always found statistics baffling! I have no idea how a bimodal distribution might affect these values or their interpretation. More research needed.

That’s It For This One

This post has taken me quite some time to complete. Quite a bit more than I expected. So, I am going to stick to my weekly, every Monday, posting schedule for now. Until next time.

Not Quite!

Well there you go, not quite done. Decided to plot the estimated normal probability density function for our binned data — with help from SciPy. Got a bit of a surprise.

histogram of China 2011 population with plot estimated probability density function overlaid

Look at that tail to the right. Almost a perfect match to the PDF. Not so much to the left side. So perhaps the distribution is indeed skewed left. Or would be if you could have ages below zero.

However, this is a sample of one country. Hardly conclusive. And one with a very large population — which might eliminate the kind of variances we might see in a smaller sample. As might the political situation. Take for instance the many years of the one-child and two-child policies. Maybe I should have used a different country for this coverage of descriptive statistics. But, a little late now. Those policies might also explain the dip in the population aged 25-40. I will continue to use China 2011 for the next few posts covering the code so we can compare function outputs to this one.

At some point, once we get the code for calculating our measures sorted, we will randomly sample a few countries, generate our descriptive statistics and check for any significant skewness. That is probably at least a couple of posts away. But…

Resources

Postscript

To make life simpler I wrote a small function to generate the HTML for the table at the top of this post. I used it to print the HTML out to the terminal window, copied and pasted in the post. Here’s the code.

def data_table(cr_nm, years):
  dbg_data = pdb.get_1cr_years_all(cr_nm, years)  
  
  a_grps = chart.get_agrp_lbls()

  d_tbl = '<table>\n'
  d_tbl += '<tr><th>Age Group</th>'
  for yr in years:
    d_tbl += f'<th>{yr}</th>'
  d_tbl += '</tr>\n'

  for i, a_grp in enumerate(a_grps):
    d_tbl += f'<tr><th>{a_grp}</th>'
    for yr in years:
      d_tbl += f'<td>{dbg_data[yr][i]}</td>'
    d_tbl += '</tr>\n'

  d_tbl += '</table>\n'

  return d_tbl

Too Old To Code

Descriptive Statistics — Preamble