As near as I can make out, pandas is truly a beast. I am a little uncertain where to go next. But somewhere I must go.
pandas inherits and adds to the NumPy element-wise operations (e.g. addition, multiplication) and the ufuncs we looked at previously. Pandas adds its own benefits. Such as, preservation of index and column labels and index alignment.
The handling of missing data is a critical aspect of the data science process. Pandas does offer some assistance there as well. Then there is the concept of hierarchical indexing. And, we will at times need to combine datasets and perform some basic analysis (aggregation, grouping, pivot tables). Not sure which of those I will cover and which I may leave for when needed in later exercises.
But we should definitely have a look at what pandas does to help with processing time series.
Let’s start with a quick look at using NumPy’s unfuncs with pandas objects. And some of the consequences.
import numpy as np
import pandas as pd
# seed random number generator for reproducibility
rng = np.random.default_rng(seed=42)
s1 = pd.Series(rng.integers(0, 16, 6))
df1 = pd.DataFrame(rng.integers(0, 16, (3, 4)), columns=['eenie', 'meenie', 'miney', 'mo'])
print(s1)
print('\n', df1)
# let's try some ufuncs
s2 = np.exp2(s1)
print(s2)
# np.cos expects radians, let's divide our values by some arbitrary number, say 6, and then multiply by pi
df2 = np.cos(df1 * np.pi / 6)
print('\n', df2)
# note the change in data type, but also that indices are preserved
You will have noted that the generated objects retain the appropriate indices from the root operand(s). Now let’s do some arithmetic with two objects.
For a Series, pandas will keep the indexes of both objects (well, their union). And it will assign NaN
(Not a Number) to those where the calculation could not be performed because a suitable value was missing in one of the Series. This is pandas default convention for handling missing data.
# okay now let's look at what happens when operating on two objects
# 7 largest countries by area in sq km and 7 largest by population in 2021
area = {'Russia': 17098242, 'Canada': 9984670, 'China': 9706961,
'United States': 9372610, 'Brazil': 8515767, 'Australia': 7692024,
'India': 3287590}
pop = {'China': 1439323776, 'India': 1380004385, 'United States': 331002651,
'Indonesia': 273523615, 'Pakistan': 220892340, 'Brazil': 212559417,
'Nigeria': 206139589}
s_area = pd.Series(area, name='area')
s_pop = pd.Series(pop, name='population')
s_density = s_pop / s_area
print(s_density)
You will note that pandas aligns the indices correctly, regardless of their order in the two Series. And, it also sorts the index for the resulting Series.
In some special cases, you may not want to have missing values, e.g. NaN, in the result. If appropriate, you can specify a fill value to be used for any missing values in either operand when the calculation is performed. I don’t at present have a suitable example, and the following is likely not a suitable example. Doubt those zeroes and infinities are of much value. It is simply a demonstration.
# if you don't want to see NaNs in the result, you can specify a fill value to be used for any missing values
s_den_2 = s_pop.divide(s_area, fill_value=0)
print(s_den_2)
Now how about computations with DataFrames.
# let's get two DataFrames with dice rolls, for a set of two die and a set of 3 die
rng = np.random.default_rng(seed=42)
df_2d = pd.DataFrame(rng.integers(1, 7, (2, 2)), columns=['blue', 'red'])
df_3d = pd.DataFrame(rng.integers(1, 7, (3, 3)), columns=['green', 'red', 'blue'])
print(df_2d)
print()
print(df_3d)
df_2s2 = df_2d + df_3d.iloc[:2, 1:]
print()
print(df_2s2)
df_2s3 = df_2d + df_3d.iloc[:2, :]
print()
print(df_2s3)
df_sum = df_2d + df_3d
print()
print(df_sum)
Once again indices from both operands are preserved and properly aligned, calculation carried out where possible and NaN
assigned if operation missing an operand. Ouput indices sorted appropriately.
And, we can use a fill value here as well.
# let's try that with a fill value
# how about the median value of the 3 dice roll data
# we need to 'reshape' df_3d into a series first as median() operates on a column or row
rs_3d = df_3d.stack()
# here's what it now looks like
print(rs_3d, '\n')
f_val = rs_3d.median()
print(f"f_val = {f_val}\n")
df_sum = df_2d.add(df_3d, fill_value=f_val)
print(df_sum)
And how about operations between a DataFrame and a Series? Let’s take a quick look. As pandas is built on NumPy, NumPy’s default to row-based operations applies in this case as well. But, that can be overridden using the object methods tied to the normal Python operators — example included.
# a new DataFrame
rng = np.random.default_rng(seed=42)
df_4d = pd.DataFrame(rng.integers(1, 7, (3, 4)), columns=['red', 'green', 'blue', 'white'])
print(df_4d, '\n')
print(df_4d - df_4d.iloc[0], '\n')
# row-wise operation is the default, but...
df_sub = df_4d.subtract(df_4d['red'], axis=0)
print(df_sub, '\n')
# and that aligning of indices works here as well
s_part = df_4d.iloc[0, 1::2]
print(s_part, '\n')
print(df_4d - s_part)
And as expected, index alignment carries through.
Done, m’thinks
You know, that took a bit more time than I expected. So going to close off this post. Feel free to download my notebook covering the above and play around.
Resources
- Stop using numpy.random.seed()
- Legacy Random Generation
- Random Generator
- pandas.DataFrame.stack
- pandas.DataFrame.median
- Most Populous Countries in the World (2021)
- Largest Countries in the World (by area)