Aggregate and Statistical Functions

One of the first things most people do when looking at data is generate some basic summary statistics. We commonly think of the descriptive statistics of the mean and standard deviation. But, it is also reasonable and sometimes helpful to look at things like the minimum, maximum, median, quantiles and the like.

NumPy has a variety of aggregation functions to help us along. Let’s start with a couple of simple ones.

I will be using pandas to read a CSV file into my workbook. But for now that’s it. From the CSV data I will extra the specific data I am interested in (monthly rain fall) into a NumPy array. Note, that the 2014 year is incomplete, so I will not be using it for most of the examples in this post. Do, also, note the NaN for the first 2 months of 2014.

In [1]:

import numpy as np
import pandas as pd

In [2]:

rain = pd.read_csv('data/rainGauge.csv')
print(rain)
# 2014 incomplete so only use the remaining years
yrs_use = ['2015', '2016', '2017', '2018', '2019', '2020']
data = np.array(rain[yrs_use])
print('\n', data)

Month 2014 2015 2016 2017 2018 2019 2020 0 January NaN 253.00 215.25 99.9 319.1 202.5 460.45 1 February NaN 129.50 202.50 111.0 161.3 110.1 144.00 2 March 174.25 225.00 203.25 387.5 145.4 52.5 103.30 3 April 116.25 59.75 35.50 210.0 199.5 151.6 38.20 4 May 143.00 21.50 57.33 142.3 11.2 53.2 120.30 5 June 41.50 5.50 73.50 54.5 81.9 24.5 73.80 6 July 26.00 21.50 35.50 2.8 15.6 58.9 37.20 7 August 14.00 78.50 22.50 7.1 11.7 34.6 45.30 8 September 115.50 54.00 84.00 58.0 189.6 191.2 115.30 9 October 201.25 142.00 260.50 198.7 121.1 155.0 145.30 10 November 277.25 302.50 350.00 273.6 329.1 121.1 261.60 11 December 187.50 278.00 170.90 215.5 339.0 302.6 408.10

[[253. 215.25 99.9 319.1 202.5 460.45] [129.5 202.5 111. 161.3 110.1 144. ] [225. 203.25 387.5 145.4 52.5 103.3 ] [ 59.75 35.5 210. 199.5 151.6 38.2 ] [ 21.5 57.33 142.3 11.2 53.2 120.3 ] [ 5.5 73.5 54.5 81.9 24.5 73.8 ] [ 21.5 35.5 2.8 15.6 58.9 37.2 ] [ 78.5 22.5 7.1 11.7 34.6 45.3 ] [ 54. 84. 58. 189.6 191.2 115.3 ] [142. 260.5 198.7 121.1 155. 145.3 ] [302.5 350. 273.6 329.1 121.1 261.6 ] [278. 170.9 215.5 339. 302.6 408.1 ]]

Minimum, Maximum and Sum

Vector/1-Dimensional Array

Let’s begin by looking at just the first complete year (2015) of data.

In [3]:

np.set_printoptions(threshold=1000)
r_2015 = data[:, 0]
print(f"2015 monthly rainfall:\n{r_2015}")
# get total year's rainfall, compare with Python built-in sum() function
# but should always use NumPy function for speed, especially large data sets
r_tot = np.sum(r_2015)
print(f"\n2015 total rainfall: {r_tot} => {sum(r_2015)} (Python sum())")
# now the min and max
print(f"\n2015 min: {np.min(r_2015)}, 2015 max: {np.max(r_2015)}")
# in this case, we could also use the np_array class method calls
print(f"2015 min: {r_2015.min()}, 2015 max: {r_2015.max()}")

2015 monthly rainfall:
[253.   129.5  225.    59.75  21.5    5.5   21.5   78.5   54.   142.
 302.5  278.  ]
2015 total rainfall: 1570.75 => 1570.75 (Python sum())
2015 min: 5.5, 2015 max: 302.5
2015 min: 5.5, 2015 max: 302.5

Multidimensional Array

Well, 2-D in our case. NumPy’s aggregate functions work just fine with multidimensional arrays. So, let’s have a shot at the full 72 months.

In [4]:

print("72 Months Inclusive:")
print(f"\tmin: {data.min()}, max: {data.max()}")
# how about the total for the 72 months
print(f"\ttotal: {data.sum():.2f}")

72 Months Inclusive:
	min: 2.8, max: 460.45
	total: 10377.53

Multi-Dimensional Array Along a Specific Axis

But we can also have NumPy do this by the years or months in the data array. We just need to ask NumPy to collapse the data along a specific axis. In our case the columns are the years, so to get the min for each year, we need to collapse the data on the 0 axis. By month along the 1 axis.

In [5]:

print("By year:")
print(f"\tmin: {data.min(axis=0)}\n\tmax: {data.max(axis=0)}\n\ttot:{data.sum(axis=0)}")

By year:
	min: [   5.5    22.5     2.8    11.2    24.5    37.2 ]
	max: [ 302.5   350.    387.5   339.    302.6   460.45 ]
	tot:[ 1570.75 1710.73 1760.9  1924.5  1457.8  1952.85 ]

In [6]:

print("By month:")
print(f"\tmin: {data.min(axis=1)}\n\tmax: {data.max(axis=1)}\n\ttot:{data.sum(axis=1)}")
# though total for each month across the years is likely not of much really value

By month:
	min: [ 99.9  110.1   52.5   35.5   11.2    5.5   2.8   7.1  54.   121.1  121.1  170.9 ]
	max: [460.45 202.5  387.5  210.   142.3   81.9  58.9  78.5 191.2  260.5  350.   408.1 ]
	tot: [1550.2 858.4 1116.95 694.55 405.83 313.7 171.5 199.7 692.1 1022.6 1637.9 1714.1 ]

Some of the Available Functions

Do note that in data science we must be aware of missing data and the consequences. NumPy provides NaN-safe counterparts to many of it functions (check documentation for which ones are available for your version of NumPy). Missing values need to be identified by the special value NaN.

Function	NaN-safe	Description
np.min	np.nanmin	Minimum value
np.max	np.nanmax	Maximum value
np.sum	np.nansum	Sum of all elements along specified axis/axes
np.mean	np.nanmean	Mean of elements along specified axis/axes
np.std	np.nanstd	Standard deviation
np.var	np.nanvar	Variance
np.median	np.nanmedian	Median of specified elements
np.percentile	np.nanpercentile	Rank-based statistics of specified elements
np.any	n/a	Are any elements true
np.all	n/a	Are all elements true

Example: Rainfall Statistics

The whole point of these functions is to allow us to summarize our data set. So, let’s have a look.

In [7]:

# Let's start with the mean, standard deviation and such by month
np.set_printoptions(threshold=6, edgeitems=4)
print(f"Mean: {data.mean(axis=1)}")
print(f"Std Dev: {data.std(axis=1)}")
print(f"Min: {data.min(axis=1)}")
print(f"Max: {data.max(axis=1)}")
print(f"Median: {np.median(data, axis=1)}")
print(f"25th Percentile: {np.percentile(data, 25, axis=1)}")
print(f"75th Percentile: {np.percentile(data, 75, axis=1)}")

Mean: [258.37 143.07 186.16 115.76 ... 115.35 170.43 272.98 285.68]
Std Dev: [111.54  32.05 107.03  73.91 ...  56.72  46.59  74.33  77.75]
Min: [ 99.9 110.1  52.5  35.5 ...  54.  121.1 121.1 170.9]
Max: [460.45 202.5  387.5  210.   ... 191.2  260.5  350.   408.1 ]
Median: [234.12 136.75 174.32 105.67 ...  99.65 150.15 288.05 290.3 ]
25th Percentile: [205.69 115.62 113.83  43.59 ...  64.5  142.82 264.6  231.12]
75th Percentile: [302.58 156.98 219.56 187.53 ... 171.03 187.77 322.45 329.9 ]

How about the average of the yearly totals? I am hoping I can stirng together some class methods and get this done on one line. What do you think?

In [8]:

print(f"Mean annual total: {data.sum(axis=0).mean():.2f}")

Mean annual total: 1729.59

Which matches my calculation using the annual totals we generated earlier, see [5].

Of course, plotting the data, as previously discussed, is also important in our initial analysis and data summarization. We’ll get into that when we discuss matplotlib.

NaN vs Zero

Before we call it a day, perhaps a quick look at NaN using the data for 2014.

In [9]:

np.set_printoptions(threshold=1000)
miss_data = np.array(rain["2014"])
print(miss_data)
print("\n2014 data (missing two months as NaN):")
print(f"\tsum (numpy, NaN): {miss_data.sum():.2f}")
print(f"\tmean 2014 (numpy, NaN): {miss_data.mean():.2f}")

[   nan    nan 174.25 116.25 143.    41.5   26.    14.   115.5  201.25
 277.25 187.5 ]
2014 data (missing two months as NaN):
sum (numpy, NaN): nan
mean 2014 (numpy, NaN): nan

Now let’s replace both NaN with 0. And see what happens.

In [10]:

miss_data[0:2] = 0
print(miss_data)
print("\n2014 data (missing two months as 0):")
print(f"\tsum (numpy, zeroes): {miss_data.sum():.2f}")
# and the mean
print(f"\tmean 2014 (numpy, zeroes): {miss_data.mean():.2f}")
# which is equivalent to
print(f"\tmean 2014 (calc, sum/12): {(miss_data.sum() / 12):.2f}")

[  0.     0.   174.25 116.25 143.    41.5   26.    14.   115.5  201.25
 277.25 187.5 ]
2014 data (missing two months as 0):
sum (numpy, zeroes): 1296.50
mean 2014 (numpy, zeroes): 108.04
mean 2014 (calc, sum/12): 108.04

I expect which to use will depend a lot on the situation. But, in general I expect NaN will be the better choice as it will let us know when we are missing data. Which could be rather important for the calculation in question.

Break Time

I think that’s it for this one. See you next time. Sorry, lot’s more NumPy to come.

Feel free to download my notebook covering the above and play around.

Too Old To Code

Data Science Basics: NumPy — Summarizing Data