Aggregate and Statistical Functions
One of the first things most people do when looking at data is generate some basic summary statistics. We commonly think of the descriptive statistics of the mean and standard deviation. But, it is also reasonable and sometimes helpful to look at things like the minimum, maximum, median, quantiles and the like.
NumPy has a variety of aggregation functions to help us along. Let’s start with a couple of simple ones.
I will be using pandas to read a CSV file into my workbook. But for now that’s it. From the CSV data I will extra the specific data I am interested in (monthly rain fall) into a NumPy array. Note, that the 2014 year is incomplete, so I will not be using it for most of the examples in this post. Do, also, note the NaN
for the first 2 months of 2014.
import numpy as np
import pandas as pd
rain = pd.read_csv('data/rainGauge.csv')
print(rain)
# 2014 incomplete so only use the remaining years
yrs_use = ['2015', '2016', '2017', '2018', '2019', '2020']
data = np.array(rain[yrs_use])
print('\n', data)
Minimum, Maximum and Sum
Vector/1-Dimensional Array
Let’s begin by looking at just the first complete year (2015) of data.
np.set_printoptions(threshold=1000)
r_2015 = data[:, 0]
print(f"2015 monthly rainfall:\n{r_2015}")
# get total year's rainfall, compare with Python built-in sum() function
# but should always use NumPy function for speed, especially large data sets
r_tot = np.sum(r_2015)
print(f"\n2015 total rainfall: {r_tot} => {sum(r_2015)} (Python sum())")
# now the min and max
print(f"\n2015 min: {np.min(r_2015)}, 2015 max: {np.max(r_2015)}")
# in this case, we could also use the np_array class method calls
print(f"2015 min: {r_2015.min()}, 2015 max: {r_2015.max()}")
Multidimensional Array
Well, 2-D in our case. NumPy’s aggregate functions work just fine with multidimensional arrays. So, let’s have a shot at the full 72 months.
print("72 Months Inclusive:")
print(f"\tmin: {data.min()}, max: {data.max()}")
# how about the total for the 72 months
print(f"\ttotal: {data.sum():.2f}")
Multi-Dimensional Array Along a Specific Axis
But we can also have NumPy do this by the years or months in the data array. We just need to ask NumPy to collapse the data along a specific axis. In our case the columns are the years, so to get the min for each year, we need to collapse the data on the 0
axis. By month along the 1
axis.
print("By year:")
print(f"\tmin: {data.min(axis=0)}\n\tmax: {data.max(axis=0)}\n\ttot:{data.sum(axis=0)}")
print("By month:")
print(f"\tmin: {data.min(axis=1)}\n\tmax: {data.max(axis=1)}\n\ttot:{data.sum(axis=1)}")
# though total for each month across the years is likely not of much really value
Some of the Available Functions
Do note that in data science we must be aware of missing data and the consequences. NumPy provides NaN
-safe counterparts to many of it functions (check documentation for which ones are available for your version of NumPy). Missing values need to be identified by the special value NaN
.
Function | NaN-safe | Description |
---|---|---|
np.min | np.nanmin | Minimum value |
np.max | np.nanmax | Maximum value |
np.sum | np.nansum | Sum of all elements along specified axis/axes |
np.mean | np.nanmean | Mean of elements along specified axis/axes |
np.std | np.nanstd | Standard deviation |
np.var | np.nanvar | Variance |
np.median | np.nanmedian | Median of specified elements |
np.percentile | np.nanpercentile | Rank-based statistics of specified elements |
np.any | n/a | Are any elements true |
np.all | n/a | Are all elements true |
Example: Rainfall Statistics
The whole point of these functions is to allow us to summarize our data set. So, let’s have a look.
# Let's start with the mean, standard deviation and such by month
np.set_printoptions(threshold=6, edgeitems=4)
print(f"Mean: {data.mean(axis=1)}")
print(f"Std Dev: {data.std(axis=1)}")
print(f"Min: {data.min(axis=1)}")
print(f"Max: {data.max(axis=1)}")
print(f"Median: {np.median(data, axis=1)}")
print(f"25th Percentile: {np.percentile(data, 25, axis=1)}")
print(f"75th Percentile: {np.percentile(data, 75, axis=1)}")
How about the average of the yearly totals? I am hoping I can stirng together some class methods and get this done on one line. What do you think?
print(f"Mean annual total: {data.sum(axis=0).mean():.2f}")
Which matches my calculation using the annual totals we generated earlier, see [5].
Of course, plotting the data, as previously discussed, is also important in our initial analysis and data summarization. We’ll get into that when we discuss matplotlib
.
NaN vs Zero
Before we call it a day, perhaps a quick look at NaN
using the data for 2014.
np.set_printoptions(threshold=1000)
miss_data = np.array(rain["2014"])
print(miss_data)
print("\n2014 data (missing two months as NaN):")
print(f"\tsum (numpy, NaN): {miss_data.sum():.2f}")
print(f"\tmean 2014 (numpy, NaN): {miss_data.mean():.2f}")
Now let’s replace both NaN
with 0. And see what happens.
miss_data[0:2] = 0
print(miss_data)
print("\n2014 data (missing two months as 0):")
print(f"\tsum (numpy, zeroes): {miss_data.sum():.2f}")
# and the mean
print(f"\tmean 2014 (numpy, zeroes): {miss_data.mean():.2f}")
# which is equivalent to
print(f"\tmean 2014 (calc, sum/12): {(miss_data.sum() / 12):.2f}")
I expect which to use will depend a lot on the situation. But, in general I expect NaN
will be the better choice as it will let us know when we are missing data. Which could be rather important for the calculation in question.
Break Time
I think that’s it for this one. See you next time. Sorry, lot’s more NumPy to come.
Feel free to download my notebook covering the above and play around.
Resources
- numpy.set_printoptions
- NumPy Sums, products, differences
- NumPy Statistics