Aggregate and Statistical Functions

One of the first things most people do when looking at data is generate some basic summary statistics. We commonly think of the descriptive statistics of the mean and standard deviation. But, it is also reasonable and sometimes helpful to look at things like the minimum, maximum, median, quantiles and the like.

NumPy has a variety of aggregation functions to help us along. Let’s start with a couple of simple ones.

I will be using pandas to read a CSV file into my workbook. But for now that’s it. From the CSV data I will extra the specific data I am interested in (monthly rain fall) into a NumPy array. Note, that the 2014 year is incomplete, so I will not be using it for most of the examples in this post. Do, also, note the NaN for the first 2 months of 2014.

In [1]:
import numpy as np
import pandas as pd
In [2]:
rain = pd.read_csv('data/rainGauge.csv')
print(rain)
# 2014 incomplete so only use the remaining years
yrs_use = ['2015', '2016', '2017', '2018', '2019', '2020']
data = np.array(rain[yrs_use])
print('\n', data)
        Month    2014    2015    2016   2017   2018   2019    2020
0     January     NaN  253.00  215.25   99.9  319.1  202.5  460.45
1    February     NaN  129.50  202.50  111.0  161.3  110.1  144.00
2       March  174.25  225.00  203.25  387.5  145.4   52.5  103.30
3       April  116.25   59.75   35.50  210.0  199.5  151.6   38.20
4         May  143.00   21.50   57.33  142.3   11.2   53.2  120.30
5        June   41.50    5.50   73.50   54.5   81.9   24.5   73.80
6        July   26.00   21.50   35.50    2.8   15.6   58.9   37.20
7      August   14.00   78.50   22.50    7.1   11.7   34.6   45.30
8   September  115.50   54.00   84.00   58.0  189.6  191.2  115.30
9     October  201.25  142.00  260.50  198.7  121.1  155.0  145.30
10   November  277.25  302.50  350.00  273.6  329.1  121.1  261.60
11   December  187.50  278.00  170.90  215.5  339.0  302.6  408.10

[[253. 215.25 99.9 319.1 202.5 460.45] [129.5 202.5 111. 161.3 110.1 144. ] [225. 203.25 387.5 145.4 52.5 103.3 ] [ 59.75 35.5 210. 199.5 151.6 38.2 ] [ 21.5 57.33 142.3 11.2 53.2 120.3 ] [ 5.5 73.5 54.5 81.9 24.5 73.8 ] [ 21.5 35.5 2.8 15.6 58.9 37.2 ] [ 78.5 22.5 7.1 11.7 34.6 45.3 ] [ 54. 84. 58. 189.6 191.2 115.3 ] [142. 260.5 198.7 121.1 155. 145.3 ] [302.5 350. 273.6 329.1 121.1 261.6 ] [278. 170.9 215.5 339. 302.6 408.1 ]]

Minimum, Maximum and Sum

Vector/1-Dimensional Array

Let’s begin by looking at just the first complete year (2015) of data.

In [3]:
np.set_printoptions(threshold=1000)
r_2015 = data[:, 0]
print(f"2015 monthly rainfall:\n{r_2015}")
# get total year's rainfall, compare with Python built-in sum() function
# but should always use NumPy function for speed, especially large data sets
r_tot = np.sum(r_2015)
print(f"\n2015 total rainfall: {r_tot} => {sum(r_2015)} (Python sum())")
# now the min and max
print(f"\n2015 min: {np.min(r_2015)}, 2015 max: {np.max(r_2015)}")
# in this case, we could also use the np_array class method calls
print(f"2015 min: {r_2015.min()}, 2015 max: {r_2015.max()}")
2015 monthly rainfall:
[253.   129.5  225.    59.75  21.5    5.5   21.5   78.5   54.   142.
 302.5  278.  ]

2015 total rainfall: 1570.75 => 1570.75 (Python sum())

2015 min: 5.5, 2015 max: 302.5 2015 min: 5.5, 2015 max: 302.5

Multidimensional Array

Well, 2-D in our case. NumPy’s aggregate functions work just fine with multidimensional arrays. So, let’s have a shot at the full 72 months.

In [4]:
print("72 Months Inclusive:")
print(f"\tmin: {data.min()}, max: {data.max()}")
# how about the total for the 72 months
print(f"\ttotal: {data.sum():.2f}")
72 Months Inclusive:
	min: 2.8, max: 460.45
	total: 10377.53

Multi-Dimensional Array Along a Specific Axis

But we can also have NumPy do this by the years or months in the data array. We just need to ask NumPy to collapse the data along a specific axis. In our case the columns are the years, so to get the min for each year, we need to collapse the data on the 0 axis. By month along the 1 axis.

In [5]:
print("By year:")
print(f"\tmin: {data.min(axis=0)}\n\tmax: {data.max(axis=0)}\n\ttot:{data.sum(axis=0)}")
By year:
	min: [   5.5    22.5     2.8    11.2    24.5    37.2 ]
	max: [ 302.5   350.    387.5   339.    302.6   460.45 ]
	tot:[ 1570.75 1710.73 1760.9  1924.5  1457.8  1952.85 ]
In [6]:
print("By month:")
print(f"\tmin: {data.min(axis=1)}\n\tmax: {data.max(axis=1)}\n\ttot:{data.sum(axis=1)}")
# though total for each month across the years is likely not of much really value
By month:
	min: [ 99.9  110.1   52.5   35.5   11.2    5.5   2.8   7.1  54.   121.1  121.1  170.9 ]
	max: [460.45 202.5  387.5  210.   142.3   81.9  58.9  78.5 191.2  260.5  350.   408.1 ]
	tot: [1550.2 858.4 1116.95 694.55 405.83 313.7 171.5 199.7 692.1 1022.6 1637.9 1714.1 ]

Some of the Available Functions

Do note that in data science we must be aware of missing data and the consequences. NumPy provides NaN-safe counterparts to many of it functions (check documentation for which ones are available for your version of NumPy). Missing values need to be identified by the special value NaN.

FunctionNaN-safeDescription
np.minnp.nanminMinimum value
np.maxnp.nanmaxMaximum value
np.sumnp.nansumSum of all elements along specified axis/axes
np.meannp.nanmeanMean of elements along specified axis/axes
np.stdnp.nanstdStandard deviation
np.varnp.nanvarVariance
np.mediannp.nanmedianMedian of specified elements
np.percentilenp.nanpercentileRank-based statistics of specified elements
np.anyn/aAre any elements true
np.alln/aAre all elements true

Example: Rainfall Statistics

The whole point of these functions is to allow us to summarize our data set. So, let’s have a look.

In [7]:
# Let's start with the mean, standard deviation and such by month
np.set_printoptions(threshold=6, edgeitems=4)
print(f"Mean: {data.mean(axis=1)}")
print(f"Std Dev: {data.std(axis=1)}")
print(f"Min: {data.min(axis=1)}")
print(f"Max: {data.max(axis=1)}")
print(f"Median: {np.median(data, axis=1)}")
print(f"25th Percentile: {np.percentile(data, 25, axis=1)}")
print(f"75th Percentile: {np.percentile(data, 75, axis=1)}")
Mean: [258.37 143.07 186.16 115.76 ... 115.35 170.43 272.98 285.68]
Std Dev: [111.54  32.05 107.03  73.91 ...  56.72  46.59  74.33  77.75]
Min: [ 99.9 110.1  52.5  35.5 ...  54.  121.1 121.1 170.9]
Max: [460.45 202.5  387.5  210.   ... 191.2  260.5  350.   408.1 ]
Median: [234.12 136.75 174.32 105.67 ...  99.65 150.15 288.05 290.3 ]
25th Percentile: [205.69 115.62 113.83  43.59 ...  64.5  142.82 264.6  231.12]
75th Percentile: [302.58 156.98 219.56 187.53 ... 171.03 187.77 322.45 329.9 ]

How about the average of the yearly totals? I am hoping I can stirng together some class methods and get this done on one line. What do you think?

In [8]:
print(f"Mean annual total: {data.sum(axis=0).mean():.2f}")
Mean annual total: 1729.59

Which matches my calculation using the annual totals we generated earlier, see [5].

Of course, plotting the data, as previously discussed, is also important in our initial analysis and data summarization. We’ll get into that when we discuss matplotlib.

NaN vs Zero

Before we call it a day, perhaps a quick look at NaN using the data for 2014.

In [9]:
np.set_printoptions(threshold=1000)
miss_data = np.array(rain["2014"])
print(miss_data)
print("\n2014 data (missing two months as NaN):")
print(f"\tsum (numpy, NaN): {miss_data.sum():.2f}")
print(f"\tmean 2014 (numpy, NaN): {miss_data.mean():.2f}")
[   nan    nan 174.25 116.25 143.    41.5   26.    14.   115.5  201.25
 277.25 187.5 ]

2014 data (missing two months as NaN): sum (numpy, NaN): nan mean 2014 (numpy, NaN): nan

Now let’s replace both NaN with 0. And see what happens.

In [10]:
miss_data[0:2] = 0
print(miss_data)
print("\n2014 data (missing two months as 0):")
print(f"\tsum (numpy, zeroes): {miss_data.sum():.2f}")
# and the mean
print(f"\tmean 2014 (numpy, zeroes): {miss_data.mean():.2f}")
# which is equivalent to
print(f"\tmean 2014 (calc, sum/12): {(miss_data.sum() / 12):.2f}")
[  0.     0.   174.25 116.25 143.    41.5   26.    14.   115.5  201.25
 277.25 187.5 ]

2014 data (missing two months as 0): sum (numpy, zeroes): 1296.50 mean 2014 (numpy, zeroes): 108.04 mean 2014 (calc, sum/12): 108.04

I expect which to use will depend a lot on the situation. But, in general I expect NaN will be the better choice as it will let us know when we are missing data. Which could be rather important for the calculation in question.

Break Time

I think that’s it for this one. See you next time. Sorry, lot’s more NumPy to come.

Feel free to download my notebook covering the above and play around.

Resources