With Python lists and NumPy arrays we have the means to index and slice the objects. With NumPy arrays we also have masking and fancy indexing (I haven’t yet dug into the last two). Only makes sense we can do similar things with pandas Series and DataFrame objects. So let’s have a look.

But it may be beneficial to pay heed to the following quote from the pandas documentation:

The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the optimized pandas data access methods exposed in this chapter.

Indexing and selecting data

Indexing

Pandas provides two basic indexer attributes for indexing into it’s Series and Dataframe objects: .iloc and .loc. These are not object methods as such, rather attributes that provide a specific slicing interface to the pandas objects. The first, .iloc is primarily integer position based — like a NumPy array. .loc is primarily label based — like a dictionary. For a Series this follows the pattern .loc[indexer] and for a DataFrame .loc[row_indexer, column_indexer]. Ditto for .iloc. The null slice can be used for any axis indexer. If an axis specifier is left out it is assumed to be :. So, p.loc['a'] is equivalent to p.loc['a', :].

Let’s first have a look at indexing with []. It is generally used to get lower dimensional slices. So,

  • series[label] will return a scalar
  • dataframe[column_name] will return a Series corresponding to the selected column

We can use [] to slice ranges from the objects.

  • series[label1:label2{:interval}] will return a 1-D slice
  • dataframe[row_indexer1:row_indexer2{:interval}] will provide a 2-D slice with the specified rows
In [1]:
import numpy as np
import pandas as pd
In [2]:
# let's create some objects
pop_src = {
    'Australia': {'2015': 23932.498999999993, '2016': 24262.71, '2017': 24584.619, '2018': 24898.153000000002, '2019': 25203.199999999997, '2020': 25499.881}, 
    'Brazil': {'2015': 204471.759, '2016': 206163.05599999995, '2017': 207833.825, '2018': 209469.31999999995, '2019': 211049.51900000003, '2020': 212559.40899999999}, 
    'China': {'2015': 1406847.8679999998, '2016': 1414049.353, '2017': 1421021.7940000002, '2018': 1427647.7889999999, '2019': 1433783.6919999998, '2020': 1439323.7739999997}, 
    'Japan': {'2015': 127985.13900000001, '2016': 127763.267, '2017': 127502.72799999999, '2018': 127202.19, '2019': 126860.29899999998, '2020': 126476.458}, 
    'Russia': {'2015': 144985.059, '2016': 145275.37400000004, '2017': 145530.09099999996, '2018': 145734.03399999999, '2019': 145872.26, '2020': 145934.46}, 
    'USA': {'2015': 320878.3120000001, '2016': 323015.99199999997, '2017': 325084.75800000003, '2018': 327096.263, '2019': 329064.917, '2020': 331002.64699999994}
}
pop_china = pd.Series(pop_src['China'])
print(pop_china)
print()
pops = pd.DataFrame(pop_src)
display(pops)
2015    1406847.868
2016    1414049.353
2017    1421021.794
2018    1427647.789
2019    1433783.692
2020    1439323.774
dtype: float64
AustraliaBrazilChinaJapanRussiaUSA
201523932.499204471.7591406847.868127985.139144985.059320878.312
201624262.710206163.0561414049.353127763.267145275.374323015.992
201724584.619207833.8251421021.794127502.728145530.091325084.758
201824898.153209469.3201427647.789127202.190145734.034327096.263
201925203.200211049.5191433783.692126860.299145872.260329064.917
202025499.881212559.4091439323.774126476.458145934.460331002.647
In [3]:
print(f"pop China 2015: {pop_china['2015']}")
print(f"pop China {pop_china.index[0]}: {pop_china[0]}")
pop China 2015: 1406847.8679999998
pop China 2015: 1406847.8679999998

Do note the difference in the output of the next two slices. When using index labels pandas includes the start and end index. When using their equivalent integer positions, that does not happen — as is the case with Python slices.

In [4]:
print(f"pop_china['2017':'2019']:\n")
print(pop_china['2017':'2019'])
print(f"\npop_china[2:4]:\n")
print(pop_china[2:4])
pop_china['2017':'2019']:

2017 1421021.794 2018 1427647.789 2019 1433783.692 dtype: float64

pop_china[2:4]:

2017 1421021.794 2018 1427647.789 dtype: float64

Now a quick look at DataFrames using [ ] notation.

In [5]:

print(f"pops['Japan'] =>\n{pops['Japan']}")
pops['Japan'] =>
2015    127985.139
2016    127763.267
2017    127502.728
2018    127202.190
2019    126860.299
2020    126476.458
Name: Japan, dtype: float64

Do note that a slice interval does not work in this case.

In [6]:
# a slice interval will not work
print(f"pops['Japan':'Russia'] =>\n{pops['Japan':'Russia']}")
# but passing a list of columns will work
print(f"\npops[['Japan', 'Russia']] =>\n{pops[['Japan', 'Russia']]}")
pops['Japan':'Russia'] =>
Empty DataFrame
Columns: [Australia, Brazil, China, Japan, Russia, USA]
Index: []

pops[['Japan', 'Russia']] => Japan Russia 2015 127985.139 144985.059 2016 127763.267 145275.374 2017 127502.728 145530.091 2018 127202.190 145734.034 2019 126860.299 145872.260 2020 126476.458 145934.460

Assigning values also works.

In [7]:
# we can even use that list mechanism to swap columns
pops[['Japan', 'Russia']] = pops[['Russia', 'Japan', ]]
print(f"\npops[['Japan', 'Russia']] =>\n{pops[['Japan', 'Russia']]}")
pops[['Japan', 'Russia']] =>
           Japan      Russia
2015  144985.059  127985.139
2016  145275.374  127763.267
2017  145530.091  127502.728
2018  145734.034  127202.190
2019  145872.260  126860.299
2020  145934.460  126476.458

Attribute access, in the sense of an object attribute, can also be used.

In [8]:
# now let's have a look at attribute access
# have to change the index to true strings for the following to work
s2 = pd.Series(pop_src['USA'].values(), index=list('abcdef'))
print(s2)
print()
df2 = pops.copy()
df2[['Japan', 'Russia']] = df2[['Russia', 'Japan', ]]
display(df2)
a    320878.312
b    323015.992
c    325084.758
d    327096.263
e    329064.917
f    331002.647
dtype: float64
AustraliaBrazilChinaJapanRussiaUSA
201523932.499204471.7591406847.868127985.139144985.059320878.312
201624262.710206163.0561414049.353127763.267145275.374323015.992
201724584.619207833.8251421021.794127502.728145530.091325084.758
201824898.153209469.3201427647.789127202.190145734.034327096.263
201925203.200211049.5191433783.692126860.299145872.260329064.917
202025499.881212559.4091439323.774126476.458145934.460331002.647
In [9]:
print(f"s2.d -> {s2.d}")
print(f"\ndf2.Brazil ->\n{df2.Brazil}")
s2.d -> 327096.263

df2.Brazil -> 2015 204471.759 2016 206163.056 2017 207833.825 2018 209469.320 2019 211049.519 2020 212559.409 Name: Brazil, dtype: float64

In [10]:
# the reason for changing s2's index is that you can only use attribute access if the index element is a valid Python identifier
s3 = df2.Brazil.copy()
print(s3)
2015    204471.759
2016    206163.056
2017    207833.825
2018    209469.320
2019    211049.519
2020    212559.409
Name: Brazil, dtype: float64
In [11]:
# 2019 not a valid identifier, but s3['2019'] will still work
print(s3.2019)
  File "<ipython-input-11-25e04c7bd131>", line 2
    print(s3.2019)
            ^
SyntaxError: invalid syntax
In [12]:
# and we can also assign values using this syntax
s2.d = 9.99
print(s2)
a    320878.312
b    323015.992
c    325084.758
d         9.990
e    329064.917
f    331002.647
dtype: float64
In [13]:
df2.Brazil = list(range(len(df2.index)))  # ok if column Brazil already exists
display(df2)
df2['ZZ'] = list(range(len(df2.index)))  # use this form to create a new column
display(df2)
AustraliaBrazilChinaJapanRussiaUSA
201523932.49901406847.868127985.139144985.059320878.312
201624262.71011414049.353127763.267145275.374323015.992
201724584.61921421021.794127502.728145530.091325084.758
201824898.15331427647.789127202.190145734.034327096.263
201925203.20041433783.692126860.299145872.260329064.917
202025499.88151439323.774126476.458145934.460331002.647
AustraliaBrazilChinaJapanRussiaUSAZZ
201523932.49901406847.868127985.139144985.059320878.3120
201624262.71011414049.353127763.267145275.374323015.9921
201724584.61921421021.794127502.728145530.091325084.7582
201824898.15331427647.789127202.190145734.034327096.2633
201925203.20041433783.692126860.299145872.260329064.9174
202025499.88151439323.774126476.458145934.460331002.6475

Now, onto those pandas indexer attributes. (Post getting lengthy, so this may be less detailed than I originally planned.)

Do note that the slicing and indexing conventions can be a source of confusion. If you have an explicit integer index, then series[1] will return the item with the index 1. However the slice, series[1:3] will return the slice identified by the implicit Python index values. That’s one of the reasons pandas provides its special indexer attributes.

Let’s have a quick look.

In [14]:
# let's get a Series with an explicit integer index
s4 = pd.Series(pop_src['Russia'].values(), index=[1, 3, 5, 7, 9, 2])
print(s4)
print(f"\ns4[3] -> {s4[3]:.3f}")
print(f"\ns4[3:7] ->\n{s4[3:7]}")
print(f"\ns4.loc[3] -> {s4.loc[3]:.3f}")
print(f"\ns4.loc[3:7] ->\n{s4.loc[3:7]}")
1    144985.059
3    145275.374
5    145530.091
7    145734.034
9    145872.260
2    145934.460
dtype: float64

s4[3] -> 145275.374

s4[3:7] -> 7 145734.034 9 145872.260 2 145934.460 dtype: float64

s4.loc[3] -> 145275.374

s4.loc[3:7] -> 3 145275.374 5 145530.091 7 145734.034 dtype: float64

In [15]:
# .iloc[] always uses the implicit Python numeric index
print(f"s4.iloc[1] -> {s4.iloc[1]:.3f}")
print(f"\ns4.iloc[1:4] ->\n{s4.iloc[1:4]}")
s4.iloc[1] -> 145275.374

s4.iloc[1:4] -> 3 145275.374 5 145530.091 7 145734.034 dtype: float64

With respect to DataFrames, columns take precedence over rows. So though in some circumstances we can use array based operations on a DataFrame, that is not true for all cases.

In [16]:
df3 = pops.copy()
display(df3)
# array operations
df3.values
AustraliaBrazilChinaJapanRussiaUSA
201523932.499204471.7591406847.868144985.059127985.139320878.312
201624262.710206163.0561414049.353145275.374127763.267323015.992
201724584.619207833.8251421021.794145530.091127502.728325084.758
201824898.153209469.3201427647.789145734.034127202.190327096.263
201925203.200211049.5191433783.692145872.260126860.299329064.917
202025499.881212559.4091439323.774145934.460126476.458331002.647
Out[16]:
array([[  23932.499,  204471.759, 1406847.868,  144985.059,  127985.139,
         320878.312],
       [  24262.71 ,  206163.056, 1414049.353,  145275.374,  127763.267,
         323015.992],
       [  24584.619,  207833.825, 1421021.794,  145530.091,  127502.728,
         325084.758],
       [  24898.153,  209469.32 , 1427647.789,  145734.034,  127202.19 ,
         327096.263],
       [  25203.2  ,  211049.519, 1433783.692,  145872.26 ,  126860.299,
         329064.917],
       [  25499.881,  212559.409, 1439323.774,  145934.46 ,  126476.458,
         331002.647]])
In [17]:
df3.T
Out[17]:
201520162017201820192020
Australia23932.49924262.71024584.61924898.15325203.20025499.881
Brazil204471.759206163.056207833.825209469.320211049.519212559.409
China1406847.8681414049.3531421021.7941427647.7891433783.6921439323.774
Japan144985.059145275.374145530.091145734.034145872.260145934.460
Russia127985.139127763.267127502.728127202.190126860.299126476.458
USA320878.312323015.992325084.758327096.263329064.917331002.647
In [18]:
df3.values[1]
Out[18]:
array([  24262.71 ,  206163.056, 1414049.353,  145275.374,  127763.267,
        323015.992])
In [19]:
display(df3)
df3['Brazil']
AustraliaBrazilChinaJapanRussiaUSA
201523932.499204471.7591406847.868144985.059127985.139320878.312
201624262.710206163.0561414049.353145275.374127763.267323015.992
201724584.619207833.8251421021.794145530.091127502.728325084.758
201824898.153209469.3201427647.789145734.034127202.190327096.263
201925203.200211049.5191433783.692145872.260126860.299329064.917
202025499.881212559.4091439323.774145934.460126476.458331002.647
Out[19]:
2015    204471.759
2016    206163.056
2017    207833.825
2018    209469.320
2019    211049.519
2020    212559.409
Name: Brazil, dtype: float64
In [20]:
df3.iloc[:3, :2]
Out[20]:
AustraliaBrazil
201523932.499204471.759
201624262.710206163.056
201724584.619207833.825
In [21]:
# or similarly using .loc[]
df3.loc[:'2017', :'Brazil']
Out[21]:
AustraliaBrazil
201523932.499204471.759
201624262.710206163.056
201724584.619207833.825

Just a bit more. First, you may see another indexer attribute, .ix[]. As of pandas 0.20 it is deprecated. But, it is possible to used mixed indexing with .loc and .iloc.

And, there are all sorts of ways to use fancy indexing and/or filtering to select a subset of a pandas object. We’ll look at one example.

In [22]:
# a quick look at mixed indexing
# first with loc
print(df3.loc[df3.index[[2,4]], 'Japan'])
# now with iloc
print('\n', df3.iloc[[2,4], df3.columns.get_loc('Japan')])
# for multiple columns use .get_indexer
display(df3.iloc[[2, 4], df3.columns.get_indexer(['Japan', 'USA'])])
2017    145530.091
2019    145872.260
Name: Japan, dtype: float64

2017 145530.091 2019 145872.260 Name: Japan, dtype: float64

JapanUSA
2017145530.091325084.758
2019145872.260329064.917
In [23]:
# finally a sample of fancy indexing
display(df3.loc[df3.Australia < 25000.0, 'Australia':'China'])
display(df3.loc[lambda df: df['Australia'] < 25000.0, ['Australia', 'Japan', 'USA']])
AustraliaBrazilChina
201523932.499204471.7591406847.868
201624262.710206163.0561414049.353
201724584.619207833.8251421021.794
201824898.153209469.3201427647.789
AustraliaJapanUSA
201523932.499144985.059320878.312
201624262.710145275.374323015.992
201724584.619145530.091325084.758
201824898.153145734.034327096.263

Assignment

Most of the above will also allow you to assign new values to the selected elements. Something for you to play with on your own.

Done!

There is a lot I haven’t yet looked at. Don’t know if I will in another post. Or simply wait until it needs to be used. Either way that’s it for this post. Feel free to download my notebook covering the above and play around.

Resources