With Python lists and NumPy arrays we have the means to index and slice the objects. With NumPy arrays we also have masking and fancy indexing (I haven’t yet dug into the last two). Only makes sense we can do similar things with pandas Series
and DataFrame
objects. So let’s have a look.
But it may be beneficial to pay heed to the following quote from the pandas documentation:
The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the optimized pandas data access methods exposed in this chapter.
Indexing and selecting data
Indexing
Pandas provides two basic indexer attributes for indexing into it’s Series and Dataframe objects: .iloc
and .loc
. These are not object methods as such, rather attributes that provide a specific slicing interface to the pandas objects. The first, .iloc
is primarily integer position based — like a NumPy array. .loc
is primarily label based — like a dictionary. For a Series this follows the pattern .loc[indexer]
and for a DataFrame .loc[row_indexer, column_indexer]
. Ditto for .iloc
. The null slice can be used for any axis indexer. If an axis specifier is left out it is assumed to be :
. So, p.loc['a']
is equivalent to p.loc['a', :]
.
Let’s first have a look at indexing with []
. It is generally used to get lower dimensional slices. So,
series[label]
will return a scalardataframe[column_name]
will return a Series corresponding to the selected column
We can use []
to slice ranges from the objects.
series[label1:label2{:interval}]
will return a 1-D slicedataframe[row_indexer1:row_indexer2{:interval}]
will provide a 2-D slice with the specified rows
import numpy as np
import pandas as pd
# let's create some objects
pop_src = {
'Australia': {'2015': 23932.498999999993, '2016': 24262.71, '2017': 24584.619, '2018': 24898.153000000002, '2019': 25203.199999999997, '2020': 25499.881},
'Brazil': {'2015': 204471.759, '2016': 206163.05599999995, '2017': 207833.825, '2018': 209469.31999999995, '2019': 211049.51900000003, '2020': 212559.40899999999},
'China': {'2015': 1406847.8679999998, '2016': 1414049.353, '2017': 1421021.7940000002, '2018': 1427647.7889999999, '2019': 1433783.6919999998, '2020': 1439323.7739999997},
'Japan': {'2015': 127985.13900000001, '2016': 127763.267, '2017': 127502.72799999999, '2018': 127202.19, '2019': 126860.29899999998, '2020': 126476.458},
'Russia': {'2015': 144985.059, '2016': 145275.37400000004, '2017': 145530.09099999996, '2018': 145734.03399999999, '2019': 145872.26, '2020': 145934.46},
'USA': {'2015': 320878.3120000001, '2016': 323015.99199999997, '2017': 325084.75800000003, '2018': 327096.263, '2019': 329064.917, '2020': 331002.64699999994}
}
pop_china = pd.Series(pop_src['China'])
print(pop_china)
print()
pops = pd.DataFrame(pop_src)
display(pops)
print(f"pop China 2015: {pop_china['2015']}")
print(f"pop China {pop_china.index[0]}: {pop_china[0]}")
Do note the difference in the output of the next two slices. When using index labels pandas includes the start and end index. When using their equivalent integer positions, that does not happen — as is the case with Python slices.
print(f"pop_china['2017':'2019']:\n")
print(pop_china['2017':'2019'])
print(f"\npop_china[2:4]:\n")
print(pop_china[2:4])
Now a quick look at DataFrames using [ ]
notation.
print(f"pops['Japan'] =>\n{pops['Japan']}")
Do note that a slice interval does not work in this case.
# a slice interval will not work
print(f"pops['Japan':'Russia'] =>\n{pops['Japan':'Russia']}")
# but passing a list of columns will work
print(f"\npops[['Japan', 'Russia']] =>\n{pops[['Japan', 'Russia']]}")
Assigning values also works.
# we can even use that list mechanism to swap columns
pops[['Japan', 'Russia']] = pops[['Russia', 'Japan', ]]
print(f"\npops[['Japan', 'Russia']] =>\n{pops[['Japan', 'Russia']]}")
Attribute access, in the sense of an object attribute, can also be used.
# now let's have a look at attribute access
# have to change the index to true strings for the following to work
s2 = pd.Series(pop_src['USA'].values(), index=list('abcdef'))
print(s2)
print()
df2 = pops.copy()
df2[['Japan', 'Russia']] = df2[['Russia', 'Japan', ]]
display(df2)
print(f"s2.d -> {s2.d}")
print(f"\ndf2.Brazil ->\n{df2.Brazil}")
# the reason for changing s2's index is that you can only use attribute access if the index element is a valid Python identifier
s3 = df2.Brazil.copy()
print(s3)
# 2019 not a valid identifier, but s3['2019'] will still work
print(s3.2019)
# and we can also assign values using this syntax
s2.d = 9.99
print(s2)
df2.Brazil = list(range(len(df2.index))) # ok if column Brazil already exists
display(df2)
df2['ZZ'] = list(range(len(df2.index))) # use this form to create a new column
display(df2)
Now, onto those pandas indexer attributes. (Post getting lengthy, so this may be less detailed than I originally planned.)
Do note that the slicing and indexing conventions can be a source of confusion. If you have an explicit integer index, then series[1]
will return the item with the index 1
. However the slice, series[1:3]
will return the slice identified by the implicit Python index values. That’s one of the reasons pandas provides its special indexer attributes.
Let’s have a quick look.
# let's get a Series with an explicit integer index
s4 = pd.Series(pop_src['Russia'].values(), index=[1, 3, 5, 7, 9, 2])
print(s4)
print(f"\ns4[3] -> {s4[3]:.3f}")
print(f"\ns4[3:7] ->\n{s4[3:7]}")
print(f"\ns4.loc[3] -> {s4.loc[3]:.3f}")
print(f"\ns4.loc[3:7] ->\n{s4.loc[3:7]}")
# .iloc[] always uses the implicit Python numeric index
print(f"s4.iloc[1] -> {s4.iloc[1]:.3f}")
print(f"\ns4.iloc[1:4] ->\n{s4.iloc[1:4]}")
With respect to DataFrames, columns take precedence over rows. So though in some circumstances we can use array based operations on a DataFrame, that is not true for all cases.
df3 = pops.copy()
display(df3)
# array operations
df3.values
df3.T
df3.values[1]
display(df3)
df3['Brazil']
df3.iloc[:3, :2]
# or similarly using .loc[]
df3.loc[:'2017', :'Brazil']
Just a bit more. First, you may see another indexer attribute, .ix[]
. As of pandas 0.20 it is deprecated. But, it is possible to used mixed indexing with .loc
and .iloc
.
And, there are all sorts of ways to use fancy indexing and/or filtering to select a subset of a pandas object. We’ll look at one example.
# a quick look at mixed indexing
# first with loc
print(df3.loc[df3.index[[2,4]], 'Japan'])
# now with iloc
print('\n', df3.iloc[[2,4], df3.columns.get_loc('Japan')])
# for multiple columns use .get_indexer
display(df3.iloc[[2, 4], df3.columns.get_indexer(['Japan', 'USA'])])
# finally a sample of fancy indexing
display(df3.loc[df3.Australia < 25000.0, 'Australia':'China'])
display(df3.loc[lambda df: df['Australia'] < 25000.0, ['Australia', 'Japan', 'USA']])
Assignment
Most of the above will also allow you to assign new values to the selected elements. Something for you to play with on your own.
Done!
There is a lot I haven’t yet looked at. Don’t know if I will in another post. Or simply wait until it needs to be used. Either way that’s it for this post. Feel free to download my notebook covering the above and play around.