With Python lists and NumPy arrays we have the means to index and slice the objects. With NumPy arrays we also have masking and fancy indexing (I haven’t yet dug into the last two). Only makes sense we can do similar things with pandas Series and DataFrame objects. So let’s have a look.

But it may be beneficial to pay heed to the following quote from the pandas documentation:

The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the optimized pandas data access methods exposed in this chapter.

Indexing and selecting data

Indexing

Pandas provides two basic indexer attributes for indexing into it’s Series and Dataframe objects: .iloc and .loc. These are not object methods as such, rather attributes that provide a specific slicing interface to the pandas objects. The first, .iloc is primarily integer position based — like a NumPy array. .loc is primarily label based — like a dictionary. For a Series this follows the pattern .loc[indexer] and for a DataFrame .loc[row_indexer, column_indexer]. Ditto for .iloc. The null slice can be used for any axis indexer. If an axis specifier is left out it is assumed to be :. So, p.loc['a'] is equivalent to p.loc['a', :].

Let’s first have a look at indexing with []. It is generally used to get lower dimensional slices. So,

series[label] will return a scalar
dataframe[column_name] will return a Series corresponding to the selected column

We can use [] to slice ranges from the objects.

series[label1:label2{:interval}] will return a 1-D slice
dataframe[row_indexer1:row_indexer2{:interval}] will provide a 2-D slice with the specified rows

In [1]:

import numpy as np
import pandas as pd

In [2]:

# let's create some objects
pop_src = {
    'Australia': {'2015': 23932.498999999993, '2016': 24262.71, '2017': 24584.619, '2018': 24898.153000000002, '2019': 25203.199999999997, '2020': 25499.881}, 
    'Brazil': {'2015': 204471.759, '2016': 206163.05599999995, '2017': 207833.825, '2018': 209469.31999999995, '2019': 211049.51900000003, '2020': 212559.40899999999}, 
    'China': {'2015': 1406847.8679999998, '2016': 1414049.353, '2017': 1421021.7940000002, '2018': 1427647.7889999999, '2019': 1433783.6919999998, '2020': 1439323.7739999997}, 
    'Japan': {'2015': 127985.13900000001, '2016': 127763.267, '2017': 127502.72799999999, '2018': 127202.19, '2019': 126860.29899999998, '2020': 126476.458}, 
    'Russia': {'2015': 144985.059, '2016': 145275.37400000004, '2017': 145530.09099999996, '2018': 145734.03399999999, '2019': 145872.26, '2020': 145934.46}, 
    'USA': {'2015': 320878.3120000001, '2016': 323015.99199999997, '2017': 325084.75800000003, '2018': 327096.263, '2019': 329064.917, '2020': 331002.64699999994}
}
pop_china = pd.Series(pop_src['China'])
print(pop_china)
print()
pops = pd.DataFrame(pop_src)
display(pops)

2015    1406847.868
2016    1414049.353
2017    1421021.794
2018    1427647.789
2019    1433783.692
2020    1439323.774
dtype: float64

	Australia	Brazil	China	Japan	Russia	USA
2015	23932.499	204471.759	1406847.868	127985.139	144985.059	320878.312
2016	24262.710	206163.056	1414049.353	127763.267	145275.374	323015.992
2017	24584.619	207833.825	1421021.794	127502.728	145530.091	325084.758
2018	24898.153	209469.320	1427647.789	127202.190	145734.034	327096.263
2019	25203.200	211049.519	1433783.692	126860.299	145872.260	329064.917
2020	25499.881	212559.409	1439323.774	126476.458	145934.460	331002.647

In [3]:

print(f"pop China 2015: {pop_china['2015']}")
print(f"pop China {pop_china.index[0]}: {pop_china[0]}")

pop China 2015: 1406847.8679999998
pop China 2015: 1406847.8679999998

Do note the difference in the output of the next two slices. When using index labels pandas includes the start and end index. When using their equivalent integer positions, that does not happen — as is the case with Python slices.

In [4]:

print(f"pop_china['2017':'2019']:\n")
print(pop_china['2017':'2019'])
print(f"\npop_china[2:4]:\n")
print(pop_china[2:4])

pop_china['2017':'2019']:

2017 1421021.794 2018 1427647.789 2019 1433783.692 dtype: float64

pop_china[2:4]:

2017 1421021.794 2018 1427647.789 dtype: float64

Now a quick look at DataFrames using [ ] notation.

In [5]:


print(f"pops['Japan'] =>\n{pops['Japan']}")

pops['Japan'] =>
2015    127985.139
2016    127763.267
2017    127502.728
2018    127202.190
2019    126860.299
2020    126476.458
Name: Japan, dtype: float64

Do note that a slice interval does not work in this case.

In [6]:

# a slice interval will not work
print(f"pops['Japan':'Russia'] =>\n{pops['Japan':'Russia']}")
# but passing a list of columns will work
print(f"\npops[['Japan', 'Russia']] =>\n{pops[['Japan', 'Russia']]}")

pops['Japan':'Russia'] =>
Empty DataFrame
Columns: [Australia, Brazil, China, Japan, Russia, USA]
Index: []
pops[['Japan', 'Russia']] =>
Japan      Russia
2015  127985.139  144985.059
2016  127763.267  145275.374
2017  127502.728  145530.091
2018  127202.190  145734.034
2019  126860.299  145872.260
2020  126476.458  145934.460

Assigning values also works.

In [7]:

# we can even use that list mechanism to swap columns
pops[['Japan', 'Russia']] = pops[['Russia', 'Japan', ]]
print(f"\npops[['Japan', 'Russia']] =>\n{pops[['Japan', 'Russia']]}")

pops[['Japan', 'Russia']] =>
           Japan      Russia
2015  144985.059  127985.139
2016  145275.374  127763.267
2017  145530.091  127502.728
2018  145734.034  127202.190
2019  145872.260  126860.299
2020  145934.460  126476.458

Attribute access, in the sense of an object attribute, can also be used.

In [8]:

# now let's have a look at attribute access
# have to change the index to true strings for the following to work
s2 = pd.Series(pop_src['USA'].values(), index=list('abcdef'))
print(s2)
print()
df2 = pops.copy()
df2[['Japan', 'Russia']] = df2[['Russia', 'Japan', ]]
display(df2)

a    320878.312
b    323015.992
c    325084.758
d    327096.263
e    329064.917
f    331002.647
dtype: float64

	Australia	Brazil	China	Japan	Russia	USA
2015	23932.499	204471.759	1406847.868	127985.139	144985.059	320878.312
2016	24262.710	206163.056	1414049.353	127763.267	145275.374	323015.992
2017	24584.619	207833.825	1421021.794	127502.728	145530.091	325084.758
2018	24898.153	209469.320	1427647.789	127202.190	145734.034	327096.263
2019	25203.200	211049.519	1433783.692	126860.299	145872.260	329064.917
2020	25499.881	212559.409	1439323.774	126476.458	145934.460	331002.647

In [9]:

print(f"s2.d -> {s2.d}")
print(f"\ndf2.Brazil ->\n{df2.Brazil}")

s2.d -> 327096.263

df2.Brazil -> 2015 204471.759 2016 206163.056 2017 207833.825 2018 209469.320 2019 211049.519 2020 212559.409 Name: Brazil, dtype: float64

In [10]:

# the reason for changing s2's index is that you can only use attribute access if the index element is a valid Python identifier
s3 = df2.Brazil.copy()
print(s3)

2015    204471.759
2016    206163.056
2017    207833.825
2018    209469.320
2019    211049.519
2020    212559.409
Name: Brazil, dtype: float64

In [11]:

# 2019 not a valid identifier, but s3['2019'] will still work
print(s3.2019)

  File "<ipython-input-11-25e04c7bd131>", line 2
    print(s3.2019)
            ^
SyntaxError: invalid syntax

In [12]:

# and we can also assign values using this syntax
s2.d = 9.99
print(s2)

a    320878.312
b    323015.992
c    325084.758
d         9.990
e    329064.917
f    331002.647
dtype: float64

In [13]:

df2.Brazil = list(range(len(df2.index)))  # ok if column Brazil already exists
display(df2)
df2['ZZ'] = list(range(len(df2.index)))  # use this form to create a new column
display(df2)

	Australia	Brazil	China	Japan	Russia	USA
2015	23932.499	0	1406847.868	127985.139	144985.059	320878.312
2016	24262.710	1	1414049.353	127763.267	145275.374	323015.992
2017	24584.619	2	1421021.794	127502.728	145530.091	325084.758
2018	24898.153	3	1427647.789	127202.190	145734.034	327096.263
2019	25203.200	4	1433783.692	126860.299	145872.260	329064.917
2020	25499.881	5	1439323.774	126476.458	145934.460	331002.647

	Australia	Brazil	China	Japan	Russia	USA	ZZ
2015	23932.499	0	1406847.868	127985.139	144985.059	320878.312	0
2016	24262.710	1	1414049.353	127763.267	145275.374	323015.992	1
2017	24584.619	2	1421021.794	127502.728	145530.091	325084.758	2
2018	24898.153	3	1427647.789	127202.190	145734.034	327096.263	3
2019	25203.200	4	1433783.692	126860.299	145872.260	329064.917	4
2020	25499.881	5	1439323.774	126476.458	145934.460	331002.647	5

Now, onto those pandas indexer attributes. (Post getting lengthy, so this may be less detailed than I originally planned.)

Do note that the slicing and indexing conventions can be a source of confusion. If you have an explicit integer index, then series[1] will return the item with the index 1. However the slice, series[1:3] will return the slice identified by the implicit Python index values. That’s one of the reasons pandas provides its special indexer attributes.

Let’s have a quick look.

In [14]:

# let's get a Series with an explicit integer index
s4 = pd.Series(pop_src['Russia'].values(), index=[1, 3, 5, 7, 9, 2])
print(s4)
print(f"\ns4[3] -> {s4[3]:.3f}")
print(f"\ns4[3:7] ->\n{s4[3:7]}")
print(f"\ns4.loc[3] -> {s4.loc[3]:.3f}")
print(f"\ns4.loc[3:7] ->\n{s4.loc[3:7]}")

1    144985.059
3    145275.374
5    145530.091
7    145734.034
9    145872.260
2    145934.460
dtype: float64
s4[3] -> 145275.374
s4[3:7] ->
7    145734.034
9    145872.260
2    145934.460
dtype: float64
s4.loc[3] -> 145275.374
s4.loc[3:7] ->
3    145275.374
5    145530.091
7    145734.034
dtype: float64

In [15]:

# .iloc[] always uses the implicit Python numeric index
print(f"s4.iloc[1] -> {s4.iloc[1]:.3f}")
print(f"\ns4.iloc[1:4] ->\n{s4.iloc[1:4]}")

s4.iloc[1] -> 145275.374
s4.iloc[1:4] ->
3    145275.374
5    145530.091
7    145734.034
dtype: float64

With respect to DataFrames, columns take precedence over rows. So though in some circumstances we can use array based operations on a DataFrame, that is not true for all cases.

In [16]:

df3 = pops.copy()
display(df3)
# array operations
df3.values

	Australia	Brazil	China	Japan	Russia	USA
2015	23932.499	204471.759	1406847.868	144985.059	127985.139	320878.312
2016	24262.710	206163.056	1414049.353	145275.374	127763.267	323015.992
2017	24584.619	207833.825	1421021.794	145530.091	127502.728	325084.758
2018	24898.153	209469.320	1427647.789	145734.034	127202.190	327096.263
2019	25203.200	211049.519	1433783.692	145872.260	126860.299	329064.917
2020	25499.881	212559.409	1439323.774	145934.460	126476.458	331002.647

Out[16]:

array([[  23932.499,  204471.759, 1406847.868,  144985.059,  127985.139,
         320878.312],
       [  24262.71 ,  206163.056, 1414049.353,  145275.374,  127763.267,
         323015.992],
       [  24584.619,  207833.825, 1421021.794,  145530.091,  127502.728,
         325084.758],
       [  24898.153,  209469.32 , 1427647.789,  145734.034,  127202.19 ,
         327096.263],
       [  25203.2  ,  211049.519, 1433783.692,  145872.26 ,  126860.299,
         329064.917],
       [  25499.881,  212559.409, 1439323.774,  145934.46 ,  126476.458,
         331002.647]])

In [17]:

df3.T

Out[17]:

	2015	2016	2017	2018	2019	2020
Australia	23932.499	24262.710	24584.619	24898.153	25203.200	25499.881
Brazil	204471.759	206163.056	207833.825	209469.320	211049.519	212559.409
China	1406847.868	1414049.353	1421021.794	1427647.789	1433783.692	1439323.774
Japan	144985.059	145275.374	145530.091	145734.034	145872.260	145934.460
Russia	127985.139	127763.267	127502.728	127202.190	126860.299	126476.458
USA	320878.312	323015.992	325084.758	327096.263	329064.917	331002.647

In [18]:

df3.values[1]

Out[18]:

array([  24262.71 ,  206163.056, 1414049.353,  145275.374,  127763.267,
        323015.992])

In [19]:

display(df3)
df3['Brazil']

	Australia	Brazil	China	Japan	Russia	USA
2015	23932.499	204471.759	1406847.868	144985.059	127985.139	320878.312
2016	24262.710	206163.056	1414049.353	145275.374	127763.267	323015.992
2017	24584.619	207833.825	1421021.794	145530.091	127502.728	325084.758
2018	24898.153	209469.320	1427647.789	145734.034	127202.190	327096.263
2019	25203.200	211049.519	1433783.692	145872.260	126860.299	329064.917
2020	25499.881	212559.409	1439323.774	145934.460	126476.458	331002.647

Out[19]:

2015    204471.759
2016    206163.056
2017    207833.825
2018    209469.320
2019    211049.519
2020    212559.409
Name: Brazil, dtype: float64

In [20]:

df3.iloc[:3, :2]

Out[20]:

	Australia	Brazil
2015	23932.499	204471.759
2016	24262.710	206163.056
2017	24584.619	207833.825

In [21]:

# or similarly using .loc[]
df3.loc[:'2017', :'Brazil']

Out[21]:

	Australia	Brazil
2015	23932.499	204471.759
2016	24262.710	206163.056
2017	24584.619	207833.825

Just a bit more. First, you may see another indexer attribute, .ix[]. As of pandas 0.20 it is deprecated. But, it is possible to used mixed indexing with .loc and .iloc.

And, there are all sorts of ways to use fancy indexing and/or filtering to select a subset of a pandas object. We’ll look at one example.

In [22]:

# a quick look at mixed indexing
# first with loc
print(df3.loc[df3.index[[2,4]], 'Japan'])
# now with iloc
print('\n', df3.iloc[[2,4], df3.columns.get_loc('Japan')])
# for multiple columns use .get_indexer
display(df3.iloc[[2, 4], df3.columns.get_indexer(['Japan', 'USA'])])

2017 145530.091 2019 145872.260 Name: Japan, dtype: float64

	Japan	USA
2017	145530.091	325084.758
2019	145872.260	329064.917

In [23]:

# finally a sample of fancy indexing
display(df3.loc[df3.Australia < 25000.0, 'Australia':'China'])
display(df3.loc[lambda df: df['Australia'] < 25000.0, ['Australia', 'Japan', 'USA']])

	Australia	Brazil	China
2015	23932.499	204471.759	1406847.868
2016	24262.710	206163.056	1414049.353
2017	24584.619	207833.825	1421021.794
2018	24898.153	209469.320	1427647.789

	Australia	Japan	USA
2015	23932.499	144985.059	320878.312
2016	24262.710	145275.374	323015.992
2017	24584.619	145530.091	325084.758
2018	24898.153	145734.034	327096.263

Assignment

Most of the above will also allow you to assign new values to the selected elements. Something for you to play with on your own.

Done!

There is a lot I haven’t yet looked at. Don’t know if I will in another post. Or simply wait until it needs to be used. Either way that’s it for this post. Feel free to download my notebook covering the above and play around.

Too Old To Code

Data Science Basics: pandas — Accessing/Modifying Pandas Objects

Indexing

Assignment

Done!

Resources