As near as I can make out, pandas is truly a beast. I am a little uncertain where to go next. But somewhere I must go.

pandas inherits and adds to the NumPy element-wise operations (e.g. addition, multiplication) and the ufuncs we looked at previously. Pandas adds its own benefits. Such as, preservation of index and column labels and index alignment.

The handling of missing data is a critical aspect of the data science process. Pandas does offer some assistance there as well. Then there is the concept of hierarchical indexing. And, we will at times need to combine datasets and perform some basic analysis (aggregation, grouping, pivot tables). Not sure which of those I will cover and which I may leave for when needed in later exercises.

But we should definitely have a look at what pandas does to help with processing time series.

Let’s start with a quick look at using NumPy’s unfuncs with pandas objects. And some of the consequences.

In [1]:

import numpy as np
import pandas as pd

In [2]:

# seed random number generator for reproducibility
rng = np.random.default_rng(seed=42)
s1 = pd.Series(rng.integers(0, 16, 6))
df1 = pd.DataFrame(rng.integers(0, 16, (3, 4)), columns=['eenie', 'meenie', 'miney', 'mo'])
print(s1)
print('\n', df1)

0 1 1 12 2 10 3 7 4 6 5 13 dtype: int64

eenie  meenie  miney  mo

0 1 11 3 1 1 8 15 11 12 2 11 12 8 2

In [3]:

# let's try some ufuncs
s2 = np.exp2(s1)
print(s2)
# np.cos expects radians, let's divide our values by some arbitrary number, say 6, and then multiply by pi
df2 = np.cos(df1 * np.pi / 6)
print('\n', df2)
# note the change in data type, but also that indices are preserved

0 2.0 1 4096.0 2 1024.0 3 128.0 4 64.0 5 8192.0 dtype: float64

   eenie        meenie         miney        mo

0 0.866025 8.660254e-01 6.123234e-17 0.866025 1 -0.500000 1.194340e-15 8.660254e-01 1.000000 2 0.866025 1.000000e+00 -5.000000e-01 0.500000

You will have noted that the generated objects retain the appropriate indices from the root operand(s). Now let’s do some arithmetic with two objects.

For a Series, pandas will keep the indexes of both objects (well, their union). And it will assign NaN (Not a Number) to those where the calculation could not be performed because a suitable value was missing in one of the Series. This is pandas default convention for handling missing data.

In [4]:

# okay now let's look at what happens when operating on two objects
# 7 largest countries by area in sq km and 7 largest by population in 2021
area = {'Russia': 17098242, 'Canada': 9984670, 'China': 9706961,
        'United States': 9372610, 'Brazil': 8515767, 'Australia': 7692024,
        'India': 3287590}
pop = {'China': 1439323776, 'India': 1380004385, 'United States': 331002651,
       'Indonesia': 273523615, 'Pakistan': 220892340, 'Brazil': 212559417,
       'Nigeria': 206139589}
s_area = pd.Series(area, name='area')
s_pop = pd.Series(pop, name='population')
s_density = s_pop / s_area
print(s_density)

Australia               NaN
Brazil            24.960690
Canada                  NaN
China            148.277486
India            419.761705
Indonesia               NaN
Nigeria                 NaN
Pakistan                NaN
Russia                  NaN
United States     35.315953
dtype: float64

You will note that pandas aligns the indices correctly, regardless of their order in the two Series. And, it also sorts the index for the resulting Series.

In some special cases, you may not want to have missing values, e.g. NaN, in the result. If appropriate, you can specify a fill value to be used for any missing values in either operand when the calculation is performed. I don’t at present have a suitable example, and the following is likely not a suitable example. Doubt those zeroes and infinities are of much value. It is simply a demonstration.

In [5]:

# if you don't want to see NaNs in the result, you can specify a fill value to be used for any missing values
s_den_2 = s_pop.divide(s_area, fill_value=0)
print(s_den_2)

Australia          0.000000
Brazil            24.960690
Canada             0.000000
China            148.277486
India            419.761705
Indonesia               inf
Nigeria                 inf
Pakistan                inf
Russia             0.000000
United States     35.315953
dtype: float64

Now how about computations with DataFrames.

In [6]:

# let's get two DataFrames with dice rolls, for a set of two die and a set of 3 die
rng = np.random.default_rng(seed=42)
df_2d = pd.DataFrame(rng.integers(1, 7, (2, 2)), columns=['blue', 'red'])
df_3d = pd.DataFrame(rng.integers(1, 7, (3, 3)), columns=['green', 'red', 'blue'])
print(df_2d)
print()
print(df_3d)
df_2s2 = df_2d + df_3d.iloc[:2, 1:]
print()
print(df_2s2)
df_2s3 = df_2d + df_3d.iloc[:2, :]
print()
print(df_2s3)
df_sum = df_2d + df_3d
print()
print(df_sum)

blue red 0 1 5 1 4 3

green red blue 0 3 6 1 1 5 2 1 2 4 6 5

blue red 0 2 11 1 5 5

blue green red 0 2 NaN 11 1 5 NaN 5

blue green red 0 2.0 NaN 11.0 1 5.0 NaN 5.0 2 NaN NaN NaN

Once again indices from both operands are preserved and properly aligned, calculation carried out where possible and NaN assigned if operation missing an operand. Ouput indices sorted appropriately.

And, we can use a fill value here as well.

In [7]:

# let's try that with a fill value
# how about the median value of the 3 dice roll data
# we need to 'reshape' df_3d into a series first as median() operates on a column or row
rs_3d = df_3d.stack()
# here's what it now looks like
print(rs_3d, '\n')
f_val = rs_3d.median()
print(f"f_val = {f_val}\n")
df_sum = df_2d.add(df_3d, fill_value=f_val)
print(df_sum)

0 green 3 red 6 blue 1 1 green 5 red 2 blue 1 2 green 4 red 6 blue 5 dtype: int64

f_val = 4.0

blue green red 0 2.0 7.0 11.0 1 5.0 9.0 5.0 2 9.0 8.0 10.0

And how about operations between a DataFrame and a Series? Let’s take a quick look. As pandas is built on NumPy, NumPy’s default to row-based operations applies in this case as well. But, that can be overridden using the object methods tied to the normal Python operators — example included.

In [8]:

# a new DataFrame
rng = np.random.default_rng(seed=42)
df_4d = pd.DataFrame(rng.integers(1, 7, (3, 4)), columns=['red', 'green', 'blue', 'white'])
print(df_4d, '\n')
print(df_4d - df_4d.iloc[0], '\n')
# row-wise operation is the default, but...
df_sub = df_4d.subtract(df_4d['red'], axis=0)
print(df_sub, '\n')
# and that aligning of indices works here as well
s_part = df_4d.iloc[0, 1::2]
print(s_part, '\n')
print(df_4d - s_part)

red green blue white 0 1 5 4 3 1 3 6 1 5 2 2 1 4 6

red green blue white 0 0 0 0 0 1 2 1 -3 2 2 1 -4 0 3

red green blue white 0 0 4 3 2 1 0 3 -2 2 2 0 -1 2 4

green 5 white 3 Name: 0, dtype: int64

blue green red white 0 NaN 0.0 NaN 0.0 1 NaN 1.0 NaN 2.0 2 NaN -4.0 NaN 3.0

And as expected, index alignment carries through.

Done, m’thinks

You know, that took a bit more time than I expected. So going to close off this post. Feel free to download my notebook covering the above and play around.

Resources

Stop using numpy.random.seed()
Legacy Random Generation
Random Generator
pandas.DataFrame.stack
pandas.DataFrame.median
Most Populous Countries in the World (2021)
Largest Countries in the World (by area)

Too Old To Code

Data Science Basics: pandas — Operating on Pandas Object

Done, m’thinks

Resources