< Previous Page | Home Page | Next Page >
In the following we'll look at similar means of accessing and modifying values in Pandas Series
and DataFrame
objects.
We'll start with the simple case of the one-dimensional Series
object, and then move on to the more complicated two-dimesnional DataFrame
object.
As we saw in the previous section, a Series
object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.
Like a dictionary, the Series
object provides a mapping from a collection of keys to a collection of values:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
data['b']
We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:
'a' in data
data.keys()
list(data.items())
Series
objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a Series
by assigning to a new index value:
data['e'] = 1.25
data
This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.
A Series
builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays.
Examples of these are as follows:
# slicing by explicit index
data['a':'c']
# slicing by implicit integer index
data[0:2]
Here, slicing may be a source of confusion.
Notice that when slicing with an explicit index (i.e., data['a':'c']
), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]
), the final index is excluded from the slice.
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series
.
A Pandas Series can be queried, either by the index position or the index label. As we saw, if you don't give an index to the series, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc
attribute. To query by the index label, you can use the loc
attribute.
sports = {'Football': 'Germany',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
Access a row by integer index (implicit)
s.iloc[3]
Access a row by label (explicit)
s.loc['Golf']
Keep in mind that iloc
and loc
are not methods, they are attributes. So you don't use parentheses to query them, but square brackets instead, which we'll call the indexing operator.
if you pass in an integer parameter, the operator will behave as if you want it to query via the iloc
attribute.
s[3]
If you pass in an object, it will query as if you wanted to use the label based loc
attribute.
s['Golf']
Here's an example using some new sports data, where countries are indexed by integer.
sports = {99: 'Germany',
100: 'Scotland',
101: 'Japan',
102: 'South Korea'}
s = pd.Series(sports)
s
If we try and call s[0]
, we get a key error
s[0] # we get a key error, because there's no item in the sports list with an index of zero.
s.iloc[0] # Instead we have to call iloc explicitly if we want the first item.
s[99] # the first entry has the index of 99
One last note on using the indexing operators to access series data. The .loc
attribute lets you not only modify data in place, but also add new data as well.
Pandas will automatically change the underlying NumPy types as appropriate.
s = pd.Series([1, 2, 3])
print(s)
s.loc['Animal'] = 'Dog'
print(s)
Let's also look at an example where index values are not unique.
If we have a look at countries and their national sports, it turns out that many countries seem to like the game cricket. Let's define the following DataFrame.
national_sports = pd.Series(['Australia',
'Barbados',
'Pakistan',
'England',
'Germany'],
index=['Cricket',
'Cricket',
'Cricket',
'Cricket',
'Football'])
Printing the Series returns
national_sports
Finally, we see that when we query the series for those who have cricket as their national sport, we don't get a single value, but a series itself.
national_sports.loc['Cricket']
Recall that a DataFrame
acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series
structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.
The first analogy we will consider is the DataFrame
as a dictionary of related Series
objects.
Let's return to our example of areas and populations of states:
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
The individual Series
that make up the columns of the DataFrame
can be accessed via dictionary-style indexing of the column name:
data['area']
Equivalently, we can use attribute-style access with column names that are strings:
data.area
Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame
, this attribute-style access is not possible.
For example, the DataFrame
has a pop()
method, so data.pop
will point to this rather than the "pop"
column:
data.pop is data['pop']
In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = z
rather than data.pop = z
).
Like with the Series
objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:
Similar to the series, we can extract data using the iloc
and loc
attributes. This will be considered below.
Let's create the following DataFrame of purchases done in different Stores by different people.
purchase_1 = pd.Series({'Name': 'Matthias',
'Item Purchased': 'Dog Food',
'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Thomas',
'Item Purchased': 'Kitty Litter',
'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Christina',
'Item Purchased': 'Bird Seed',
'Cost': 5.00})
df = pd.DataFrame(data = [purchase_1, purchase_2, purchase_3], index=["Store 1", "Store 2", "Store 3"])
df
Let's select data associated with Store 1. Again rows can be accessed by their row index using .iloc.
Be aware that index 0 indicates the first element (zero-based indexing).
df.iloc[0]
Because the DataFrame is two-dimensional, passing a single value to the iloc
indexing operator will return a series if there's only one row to return.
Of course we can also use the .loc
operator to access rows by the name of the label.
df.loc['Store 1']
We can check the data type of the return using the python type function.
type(df.loc['Store 2'])
We can also access multiple rows. Here we just pass a list of the desired rows to our .loc
or .iloc
operator.
df.loc[["Store 1", "Store 2"]]
df.iloc[[0,1]]
What if we just wanted to do access specific columns instead. Let's say we want to get a list of all the costs.
1. Indexing Operator
In a Pandas DataFrame, columns always have a name; therefore we can simply use indexing to access a column
df['Cost']
2. Full slice with Loc
.loc
and .iloc
both support row selection and can take two parameters, the row index and a list of column names. They also supports slicing.
If we wanted to select all rows, we can use the column operator :
to indicate a full slice from beginning to end. To select a specific column we then add the column name as the second parameter.
df.loc[:,["Cost"]]
In fact, if we wanted to include multiple columns, we could do so in a list.
a = df.loc[:,['Name', 'Cost']]
a
This time around, however, we do not get a series object as a return but another DataFrame.
type(a)
As you may have guessed, .iloc
and .loc
lets you access rows and columns objects at the same time. Here simply follow the syntax:
df.iloc[[list of rows],[list of columns]]
Make sure to not mix iloc and loc.
df.loc[["Store 1", "Store 3"], ["Name"]]
df.iloc[[0,2],[0]]
d
Boolean masking is the heart of fast and efficient querying in NumPy.
A Boolean mask is an array which can be of one dimension like a series, or two dimensions like a data frame, where each of the values in the array are either True
or False
. This array is essentially overlaid on top of the data structure that we're querying. And any cell aligned with the True
value will be admitted into our final result, and any cell aligned with a False
value will not.
Boolean masks are created by applying operators directly to the pandas Series or DataFrame objects.
For instance, in our Store data set, you might be interested in seeing only those customers who have bought a product with a cost lower than 10. To build a Boolean mask for this query, we project the Cost column using the indexing operator and apply the smaller than operator with a comparison value of 10. This is essentially broadcasting a comparison operator, smaller than, with the results being returned as a Boolean series.
df['Cost'] < 10
The resulting series is indexed where the value of each cell is either true or false depending on whether a customer has purchased a product that costed more or less than 10.
So this builds us the Boolean mask, which is half the battle. What we want to do next is overlay that mask on the data frame.
We can do this using the where function. The where function takes a Boolean mask as a condition, applies it to the data frame or series, and returns a new data frame or series of the same shape. Let's apply this Boolean mask to our data and create a data frame of only those purchases that with a cost lower than 10.
lower_ten = df.where(df['Cost'] < 10)
lower_ten
We see that the resulting data frame keeps the original indexed values, and only data from countries that met the condition are retained. All of the countries which did not meet the condition have NaN data instead. This is okay. Most statistical functions built into the data frame object ignore values of NaN.
For instance, if we call the df.count() on the lower_ten data frame:
lower_ten['Cost'].count()
we see that there are 2 purchases which were below ten cost units.
If we call count on the original data frame, we see that there are 3 purchases total.
df['Cost'].count()
We do not actually have to use the where function explicitly. The pandas developers allow the indexing operator to take a Boolean mask as a value instead of just a list of column names.
The syntax looks as follows:
df[df[Column]Boolean Condition]
Here's a more concise example of how we could query this data frame. You'll notice that there are no NaNs when you query the data frame in this manner. pandas automatically filters out the rows with now values.
lower_ten = df.loc[df['Cost'] < 10]
lower_ten
The syntax might look a little messy, especially if you're not used to programming languages with overloaded operators, but the result is that you're able to filter and reduce data frames relatively quickly.
One more thing to keep in mind if you're not used to Boolean or bit masking for data reduction. The output of two Boolean masks being compared with logical operators is another Boolean mask. This means that you can chain together a bunch of and/or statements in order to create more complex queries, and the result is a single Boolean mask.
For instance, we could create a mask for all of those purchases who were cheaper than 10 cost units and were done by Thomas.
df = df[(df['Cost'] < 10) & (df['Name'] == "Thomas")]
df
Extremely important, and often an issue for new users, is to remember that each Boolean mask needs to be encased in parenthesis because of the order of operations. This can cause no end of frustration if you're not used to it, so be careful.
There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, while indexing refers to columns, slicing refers to rows:
df['Store 2':'Store 3']
Such slices can also refer to rows by number rather than by index:
df[1:3]
Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
df[df["Cost"] < 10]
< Previous Page | Home Page | Next Page >