Open In Colab

Data Indexing and Selection

In the following we'll look at similar means of accessing and modifying values in Pandas Series and DataFrame objects.

We'll start with the simple case of the one-dimensional Series object, and then move on to the more complicated two-dimesnional DataFrame object.

Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

Series as dictionary

Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:

In [ ]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data
Out[ ]:
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
In [ ]:
data['b']
Out[ ]:
0.5

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [ ]:
'a' in data
Out[ ]:
True
In [ ]:
data.keys()
Out[ ]:
Index(['a', 'b', 'c', 'd'], dtype='object')
In [ ]:
list(data.items())
Out[ ]:
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:

In [ ]:
data['e'] = 1.25
data
Out[ ]:
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays. Examples of these are as follows:

In [ ]:
# slicing by explicit index
data['a':'c']
Out[ ]:
a    0.25
b    0.50
c    0.75
dtype: float64
In [ ]:
# slicing by implicit integer index
data[0:2]
Out[ ]:
a    0.25
b    0.50
dtype: float64

Here, slicing may be a source of confusion. Notice that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

Indexers: loc and iloc

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

A Pandas Series can be queried, either by the index position or the index label. As we saw, if you don't give an index to the series, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc attribute. To query by the index label, you can use the loc attribute.

In [ ]:
sports = {'Football': 'Germany',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
Out[ ]:
Football         Germany
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

Access a row by integer index (implicit)

In [ ]:
s.iloc[3]
Out[ ]:
'South Korea'

Access a row by label (explicit)

In [ ]:
s.loc['Golf']
Out[ ]:
'Scotland'

Keep in mind that iloc and loc are not methods, they are attributes. So you don't use parentheses to query them, but square brackets instead, which we'll call the indexing operator.

if you pass in an integer parameter, the operator will behave as if you want it to query via the iloc attribute.

In [ ]:
s[3]
Out[ ]:
'South Korea'

If you pass in an object, it will query as if you wanted to use the label based loc attribute.

In [ ]:
s['Golf']
Out[ ]:
'Scotland'

Here's an example using some new sports data, where countries are indexed by integer.

In [ ]:
sports = {99: 'Germany',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)
s
Out[ ]:
99         Germany
100       Scotland
101          Japan
102    South Korea
dtype: object

If we try and call s[0], we get a key error

In [ ]:
s[0] # we get a key error, because there's no item in the sports list with an index of zero.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-cecf3df7f7ff> in <module>()
----> 1 s[0] # we get a key error, because there's no item in the sports list with an index of zero.

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if is_hashable(key):

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
    988 
    989         # Similar to Index.get_value, but we do not fall back to positional
--> 990         loc = self.index.get_loc(label)
    991         return self.index._get_values_for_loc(self, loc, label)
    992 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 0
In [ ]:
s.iloc[0] # Instead we have to call iloc explicitly if we want the first item.
Out[ ]:
'Germany'
In [ ]:
s[99] # the first entry has the index of 99
Out[ ]:
'Germany'

One last note on using the indexing operators to access series data. The .loc attribute lets you not only modify data in place, but also add new data as well.

Pandas will automatically change the underlying NumPy types as appropriate.

In [ ]:
s = pd.Series([1, 2, 3])
print(s)
s.loc['Animal'] = 'Dog'
print(s)
0    1
1    2
2    3
dtype: int64
0           1
1           2
2           3
Animal    Dog
dtype: object

Let's also look at an example where index values are not unique.

If we have a look at countries and their national sports, it turns out that many countries seem to like the game cricket. Let's define the following DataFrame.

In [ ]:
national_sports = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England',
                                      'Germany'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Football'])

Printing the Series returns

In [ ]:
national_sports
Out[ ]:
Cricket     Australia
Cricket      Barbados
Cricket      Pakistan
Cricket       England
Football      Germany
dtype: object

Finally, we see that when we query the series for those who have cricket as their national sport, we don't get a single value, but a series itself.

In [ ]:
national_sports.loc['Cricket']
Out[ ]:
Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

DataFrame as a dictionary

The first analogy we will consider is the DataFrame as a dictionary of related Series objects. Let's return to our example of areas and populations of states:

In [ ]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
Out[ ]:
area pop
California 423967 38332521
Texas 695662 26448193
New York 141297 19651127
Florida 170312 19552860
Illinois 149995 12882135

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [ ]:
data['area']
Out[ ]:
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

In [ ]:
data.area
Out[ ]:
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Though this is a useful shorthand, keep in mind that it does not work for all cases! For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible. For example, the DataFrame has a pop() method, so data.pop will point to this rather than the "pop" column:

In [ ]:
data.pop is data['pop']
Out[ ]:
False

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = z rather than data.pop = z).

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

DataFrame as a two-dimensional array

Similar to the series, we can extract data using the iloc and loc attributes. This will be considered below.

Accessing Rows

Let's create the following DataFrame of purchases done in different Stores by different people.

In [ ]:
purchase_1 = pd.Series({'Name': 'Matthias',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Thomas',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Christina',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})


df = pd.DataFrame(data = [purchase_1, purchase_2, purchase_3], index=["Store 1", "Store 2", "Store 3"])
df
Out[ ]:
Name Item Purchased Cost
Store 1 Matthias Dog Food 22.5
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0

Let's select data associated with Store 1. Again rows can be accessed by their row index using .iloc.

Be aware that index 0 indicates the first element (zero-based indexing).

In [ ]:
df.iloc[0]
Out[ ]:
Name              Matthias
Item Purchased    Dog Food
Cost                  22.5
Name: Store 1, dtype: object

Because the DataFrame is two-dimensional, passing a single value to the iloc indexing operator will return a series if there's only one row to return.

Of course we can also use the .loc operator to access rows by the name of the label.

In [ ]:
df.loc['Store 1']
Out[ ]:
Name              Matthias
Item Purchased    Dog Food
Cost                  22.5
Name: Store 1, dtype: object

We can check the data type of the return using the python type function.

In [ ]:
type(df.loc['Store 2'])
Out[ ]:
pandas.core.series.Series

Accessing Multiple Rows

We can also access multiple rows. Here we just pass a list of the desired rows to our .loc or .iloc operator.

In [ ]:
df.loc[["Store 1", "Store 2"]]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-69-1be3222fb89b> in <module>()
----> 1 df.loc[["Store 1", "Store 2"]]

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
    877 
    878             maybe_callable = com.apply_if_callable(key, self.obj)
--> 879             return self._getitem_axis(maybe_callable, axis=axis)
    880 
    881     def _is_scalar_access(self, key: Tuple):

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1097                     raise ValueError("Cannot index with multidimensional key")
   1098 
-> 1099                 return self._getitem_iterable(key, axis=axis)
   1100 
   1101             # nested tuple slicing

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1035 
   1036         # A collection of keys
-> 1037         keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1038         return self.obj._reindex_with_indexers(
   1039             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1252             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1253 
-> 1254         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1255         return keyarr, indexer
   1256 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1314                 with option_context("display.max_seq_items", 10, "display.width", 80):
   1315                     raise KeyError(
-> 1316                         "Passing list-likes to .loc or [] with any missing labels "
   1317                         "is no longer supported. "
   1318                         f"The following labels were missing: {not_found}. "

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['Store 1'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In [ ]:
df.iloc[[0,1]]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _get_list_axis(self, key, axis)
   1468         try:
-> 1469             return self.obj._take_with_is_copy(key, axis=axis)
   1470         except IndexError as err:

/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in _take_with_is_copy(self, indices, axis)
   3362         """
-> 3363         result = self.take(indices=indices, axis=axis)
   3364         # Maybe set copy if we didn't actually change the index.

/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in take(self, indices, axis, is_copy, **kwargs)
   3350         new_data = self._mgr.take(
-> 3351             indices, axis=self._get_block_manager_axis(axis), verify=True
   3352         )

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py in take(self, indexer, axis, verify, convert)
   1448         if convert:
-> 1449             indexer = maybe_convert_indices(indexer, n)
   1450 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexers.py in maybe_convert_indices(indices, n)
    249     if mask.any():
--> 250         raise IndexError("indices are out-of-bounds")
    251     return indices

IndexError: indices are out-of-bounds

The above exception was the direct cause of the following exception:

IndexError                                Traceback (most recent call last)
<ipython-input-70-81040eed78c6> in <module>()
----> 1 df.iloc[[0,1]]

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
    877 
    878             maybe_callable = com.apply_if_callable(key, self.obj)
--> 879             return self._getitem_axis(maybe_callable, axis=axis)
    880 
    881     def _is_scalar_access(self, key: Tuple):

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1485         # a list of integers
   1486         elif is_list_like_indexer(key):
-> 1487             return self._get_list_axis(key, axis=axis)
   1488 
   1489         # a single integer

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _get_list_axis(self, key, axis)
   1470         except IndexError as err:
   1471             # re-raise with different error message
-> 1472             raise IndexError("positional indexers are out-of-bounds") from err
   1473 
   1474     def _getitem_axis(self, key, axis: int):

IndexError: positional indexers are out-of-bounds

Column Selection

What if we just wanted to do access specific columns instead. Let's say we want to get a list of all the costs.

1. Indexing Operator

In a Pandas DataFrame, columns always have a name; therefore we can simply use indexing to access a column

In [ ]:
df['Cost']
Out[ ]:
Store 1    22.5
Store 2     2.5
Store 3     5.0
Name: Cost, dtype: float64

2. Full slice with Loc

.loc and .iloc both support row selection and can take two parameters, the row index and a list of column names. They also supports slicing.

If we wanted to select all rows, we can use the column operator : to indicate a full slice from beginning to end. To select a specific column we then add the column name as the second parameter.

In [ ]:
df.loc[:,["Cost"]]
Out[ ]:
Cost
Store 1 22.5
Store 2 2.5
Store 3 5.0

In fact, if we wanted to include multiple columns, we could do so in a list.

In [ ]:
a = df.loc[:,['Name', 'Cost']]
a
Out[ ]:
Name Cost
Store 1 Matthias 22.5
Store 2 Thomas 2.5
Store 3 Christina 5.0

This time around, however, we do not get a series object as a return but another DataFrame.

In [ ]:
type(a)
Out[ ]:
pandas.core.frame.DataFrame

Accessing both Rows and Columns

As you may have guessed, .iloc and .loc lets you access rows and columns objects at the same time. Here simply follow the syntax:

df.iloc[[list of rows],[list of columns]]

Make sure to not mix iloc and loc.

In [ ]:
df.loc[["Store 1", "Store 3"], ["Name"]]
Out[ ]:
Name
Store 1 Matthias
Store 3 Christina
In [ ]:
df.iloc[[0,2],[0]]
Out[ ]:
Name
Store 1 Matthias
Store 3 Christina

d

Querying Data via Boolean Masks

Boolean masking is the heart of fast and efficient querying in NumPy.

A Boolean mask is an array which can be of one dimension like a series, or two dimensions like a data frame, where each of the values in the array are either True or False. This array is essentially overlaid on top of the data structure that we're querying. And any cell aligned with the True value will be admitted into our final result, and any cell aligned with a False value will not.

Boolean masks are created by applying operators directly to the pandas Series or DataFrame objects.

For instance, in our Store data set, you might be interested in seeing only those customers who have bought a product with a cost lower than 10. To build a Boolean mask for this query, we project the Cost column using the indexing operator and apply the smaller than operator with a comparison value of 10. This is essentially broadcasting a comparison operator, smaller than, with the results being returned as a Boolean series.

In [ ]:
df['Cost'] < 10
Out[ ]:
Store 1    False
Store 2     True
Store 3     True
Name: Cost, dtype: bool

The resulting series is indexed where the value of each cell is either true or false depending on whether a customer has purchased a product that costed more or less than 10.

So this builds us the Boolean mask, which is half the battle. What we want to do next is overlay that mask on the data frame.

Where function

We can do this using the where function. The where function takes a Boolean mask as a condition, applies it to the data frame or series, and returns a new data frame or series of the same shape. Let's apply this Boolean mask to our data and create a data frame of only those purchases that with a cost lower than 10.

In [ ]:
lower_ten = df.where(df['Cost'] < 10)
lower_ten
Out[ ]:
Name Item Purchased Cost
Store 1 NaN NaN NaN
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0

We see that the resulting data frame keeps the original indexed values, and only data from countries that met the condition are retained. All of the countries which did not meet the condition have NaN data instead. This is okay. Most statistical functions built into the data frame object ignore values of NaN.

Count function

For instance, if we call the df.count() on the lower_ten data frame:

In [ ]:
lower_ten['Cost'].count()
Out[ ]:
2

we see that there are 2 purchases which were below ten cost units.

If we call count on the original data frame, we see that there are 3 purchases total.

In [ ]:
df['Cost'].count()
Out[ ]:
3

Indexing Operator with Boolean Mask

We do not actually have to use the where function explicitly. The pandas developers allow the indexing operator to take a Boolean mask as a value instead of just a list of column names.

The syntax looks as follows:

df[df[Column]Boolean Condition]

Here's a more concise example of how we could query this data frame. You'll notice that there are no NaNs when you query the data frame in this manner. pandas automatically filters out the rows with now values.

In [ ]:
lower_ten = df.loc[df['Cost'] < 10]
lower_ten
Out[ ]:
Name Item Purchased Cost
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0

The syntax might look a little messy, especially if you're not used to programming languages with overloaded operators, but the result is that you're able to filter and reduce data frames relatively quickly.

Chain Statements

One more thing to keep in mind if you're not used to Boolean or bit masking for data reduction. The output of two Boolean masks being compared with logical operators is another Boolean mask. This means that you can chain together a bunch of and/or statements in order to create more complex queries, and the result is a single Boolean mask.

For instance, we could create a mask for all of those purchases who were cheaper than 10 cost units and were done by Thomas.

In [ ]:
df = df[(df['Cost'] < 10) & (df['Name'] == "Thomas")]
df
Out[ ]:
Name Item Purchased Cost
Store 2 Thomas Kitty Litter 2.5

Extremely important, and often an issue for new users, is to remember that each Boolean mask needs to be encased in parenthesis because of the order of operations. This can cause no end of frustration if you're not used to it, so be careful.

Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, while indexing refers to columns, slicing refers to rows:

In [ ]:
df['Store 2':'Store 3']
Out[ ]:
Name Item Purchased Cost
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0

Such slices can also refer to rows by number rather than by index:

In [ ]:
df[1:3]
Out[ ]:
Name Item Purchased Cost
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0

Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [ ]:
df[df["Cost"] < 10]