Next on our list of packages to investigate is pandas.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

pandas home page

pandas is built on top of NumPy. Most data is table like in structure. And, I expect it is seldom 3 or more dimensions in shape. pandas provides us with the ability to work with data in tabular format (think spreadsheet, database, etc.). Like a spreadsheet we have column headings/labels which can be used to access data — along with other forms of indexing. It provides a couple of fundamental data structures for us to work with: DataFrame and Series. Though there are others. The code is optimized for performance, plenty of critical bits written in C.

Though we can (and will in this post) create DataFrames directly from other data structures, it is more likely that the in-memory pandas data structures will be obtained from another source (CSV or text files, an Excel spreadsheet, a SQL database, etc). pandas provides a variety of tools to help us do just that. Once the data is in memory, pandas helps us wrangle the data, including integrated handling of missing data.

Wrangling data is a key step in all data science projects. This can involve reshaping of the data, merging data from different sources, selecting a sub-set of the original data, dealing with missing or outlier data, etc.

So, let’s start by creating some simple pandas data structures.

If you followed my environment set-up post, you already have pandas installed. If not, please install it now.

Series

Don’t know how often one would be working directly with the Series data structure. But, it is a good place to start the discussion.

Series is a one-dimensional array like structure with labels — similar in concept to a dictionary. It can hold any data type. The labels, as a group, are referred to as the index. A core element of pandas design is that the link between the labels and the data will not be broken unless you explicity do so. This is true for the DataFrame as well.

We can create a Series using the pandas .Series method. It can take a fair number of data objects from which to create the Series, including:

  • Python dictionary
  • NumPy ndarray
  • a scalar value

There is also an optional index parameter which can be used to specify the data/column labels.

Let’s have a look.

As usual we will begin by importing our required packages. For this one NumPy and pandas.

In [1]:
import numpy as np
import pandas as pd
In [2]:
# let's make sure pandas is installed and imported
pd.__version__
Out[2]:
'1.2.3'

Now, let’s use the pandas.Series() method to create some pandas Series data structures. This method takes a set of data and possibly an index argument. We’ll consider the three typs of data mentioned above. Let’s start with NumPy ndarrays.

In [3]:
# no index specified
s1 = pd.Series(np.random.randn(5))
print(s1)
print(s1.index)
# let's add an index
print()
s2 = pd.Series(np.random.randn(5), index=['i', 'j', 'k', 'm', 'n'])
print(s2)
print(s2.index)
0   -0.376533
1   -0.524440
2    1.366831
3    0.729068
4   -1.077042
dtype: float64
RangeIndex(start=0, stop=5, step=1)

i -1.124408 j -0.735771 k -0.686441 m -0.166480 n 0.833207 dtype: float64 Index(['i', 'j', 'k', 'm', 'n'], dtype='object')

And now let’s look at using a Python dictionary as our data source.

In [4]:
# let's use a dictionary
print()
d_siblings = {'Kristine': 'sister', 'Taffy': 'sister-in-law', 'Arthur': 'brother', 'Stuart': 'brother-in-law'}
s3 = pd.Series(d_siblings)
print(s3)
# how the index is arranged/sorted can vary with version of Python and pandas
print(s3.index)
Kristine            sister
Taffy        sister-in-law
Arthur             brother
Stuart      brother-in-law
dtype: object
Index(['Kristine', 'Taffy', 'Arthur', 'Stuart'], dtype='object')

And, finally a scalar value as our data source. In this case an index argument is required.

In [5]:
# If using a scalar value, an index must be provided. The value will be repeated for each index entry.
print()
s4 = pd.Series("99", index=['i', 'j', 'k', 'm', 'n'])
print(s4)
print(s4.index)
i    99
j    99
k    99
m    99
n    99
dtype: object
Index(['i', 'j', 'k', 'm', 'n'], dtype='object')

Since Series is based on ndarray, it is a valid argument for most NumPy functions. But in many cases the appropriate index values will come along with the series/vector result.

Let’s have a look at accessing the Series elements. Either to read data or to change it.

In [6]:

print(f"s2[0] = {s2[0]}")
print(f"\ns2[0:3] =\n{s2[0:3]}")
print(f"\ns2[s2 > s2.mean()] =\n{s2[s2 > s2.mean()]}")
s2[0] = -1.1244079757337493

s2[0:3] = i -1.124408 j -0.735771 k -0.686441 dtype: float64

s2[s2 > s2.mean()] = m -0.166480 n 0.833207 dtype: float64

In [7]:
# like dictionaries we can use labels to access values/slices
print(f's2["n"] = {s2["n"]}')
s2["j"] = 99
print(f'\ns2 =\n{s2}')
print(f'\n"m" in s2 = {"m" in s2}')
print(f'\n"e" in s2 = {"e" in s2}')
s2["n"] = 0.8332070816926587

s2 = i -1.124408 j 99.000000 k -0.686441 m -0.166480 n 0.833207 dtype: float64

"m" in s2 = True

"e" in s2 = False

Operations involving 2 series will return a result based on the union of the indexes. If a label is missing in either one or the other index, the reuslt will be marked as missing (NaN).

In [8]:

s5 = pd.Series(np.random.randn(5), index=['j', 'k', 'm', 'n', 'p'])
print(f"s2 + s5 = \n{s2 + s5}")
s2 + s5 = 
i          NaN
j    97.809103
k    -0.597248
m    -0.556047
n     1.384724
p          NaN
dtype: float64

There’s plenty more but that should do as an introduction.

DataFrame

The documentation describes a DataFrame as follows:

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

pandas: Intro to data structures

We will follow more or less the pattern we used above for the pandas Series data structure. To start will look at creating DataFrames from a variety of other data sources.

And, I do believe, the columns of a DataFrame are in fact Series object/data structures. So let’s start the discussion using a dictionary of Series to create a DataFrame.

Note: the display of the output tables below does not match their display in the notebook.

In [10]:
# a dict of Series
siblings = {
    "Kristine": pd.Series(['Sister', ['Cathy', 'Kevin'], 'Rick'], index=['Relationship', 'Children', 'Spouse']),
    "Arthur": pd.Series(['Brother', ['Amanda', 'Melanie'], 'Myra', ['Soma']], index=['Relationship', 'Children', 'Spouse', 'Pets']),
    "Eileen": pd.Series(['Sister-in-law', 'Laurence'], index=['Relationship', 'Spouse']),
    "Stuart": pd.Series(['Brother-in-law', ['William', 'Matthew'], 'Margaret', ['Boone']], index=['Relationship', 'Children', 'Spouse', 'Pets'])
}
# dataframes often given variable name of df
# index will be the union of the indexes of the Series in dict
df = pd.DataFrame(siblings)
# display(df)
# do note the NaNs indicating missing data once the indexes are combined
df.head(5)
Out[10]:
KristineArthurEileenStuart
Children[Cathy, Kevin][Amanda, Melanie]NaN[William, Matthew]
PetsNaN[Soma]NaN[Boone]
RelationshipSisterBrotherSister-in-lawBrother-in-law
SpouseRickMyraLaurenceMargaret
In [11]:
# let's access the children row (more than 1 way) and then some columns
print(df.iloc[0])
display(df.loc['Children', :])
display(df.head(1))
display(df.loc[:, 'Kristine':'Arthur'])
Kristine        [Cathy, Kevin]
Arthur       [Amanda, Melanie]
Eileen                     NaN
Stuart      [William, Matthew]
Name: Children, dtype: object
Kristine        [Cathy, Kevin]
Arthur       [Amanda, Melanie]
Eileen                     NaN
Stuart      [William, Matthew]
Name: Children, dtype: object

KristineArthurEileenStuart
Children[Cathy, Kevin][Amanda, Melanie]NaN[William, Matthew]
KristineArthur
Children[Cathy, Kevin][Amanda, Melanie]
PetsNaN[Soma]
RelationshipSisterBrother
SpouseRickMyra

Decided not to go over all the DataFrame creation options. Lot’s of searchable info on the web. But one more example using a NumPy 2-D array. You will need to add a suitable import to your notebook, from math import cos, sin, tan, pi. I added mine with the others in the first code cell at the top of the notebook.

In [12]:
trig = []
angles = {}
for i in range (0, 181, 45):
    rads = i * pi / 180
    angles[i] = rads
    trig.append([rads, round(cos(rads), 4), round(sin(rads), 4) , round(tan(rads), 4)])
#print(angles, '\n\n', trig)
pd_trig = pd.DataFrame(trig,
                       columns=['radians', 'cos', 'sin', 'tan'],
                       index=list(angles.keys())
                       )
display(pd_trig)

radianscossintan
00.0000001.00000.00000.000000e+00
450.7853980.70710.70711.000000e+00
901.5707960.00001.00001.633124e+16
1352.356194-0.70710.7071-1.000000e+00
1803.141593-1.00000.0000-0.000000e+00

There are also a number of valuable bits regarding the Index object. But, it boils down to the fact the Index is immutable. And that it can be treated as an ordered set allowing us to use set operations on Index objects (i.e. union, intersection, difference, etc.). I have decided not to get into operations on the Index object at this time.

Time for a Break

That’s it for this one. Next time we will take a slightly deeper look at accessing the contents of Series and DataFrame objects. Feel free to download my notebook covering the above and play around.

Resources