Open In Colab

Operations on DataFrames

In many cases, one must do more than just accessing certain rows and columns of a dataframe. Let's look at some further applications of our pandas library.

Replacing NaN

Often datasets have empty fields due to lack of data or wrong data extraction. These fields are filled with None type values which Pandas automatically converts to a special floating point value designated as NAN, which stands for not a number.

If we create a list of numbers, integers or floats, and put in the None type, Pandas automatically converts this to a special floating point value designated as NAN, which stands for not a number.

In [ ]:
import pandas as pd
import numpy as np

numbers = [1, 2, None]
pd.Series(numbers)
Out[ ]:
0    1.0
1    2.0
2    NaN
dtype: float64

Before processing our data, we often want to "clean" it. Here, we can replace all NaN fields with a real number of our choice with numpy.fillna(number).

In [ ]:
df = pd.DataFrame(data={'Company': ['Apple', 'Google', 'Intel', 'AMD', 'Startup'], '% Growth':[4, 2, 4, 8, np.nan]})

df = (df.fillna(0))
df
Out[ ]:
Company % Growth
0 Apple 4.0
1 Google 2.0
2 Intel 4.0
3 AMD 8.0
4 Startup 0.0

Dropping Data

It's easy to delete data in series and DataFrames, and we can use the drop function to do so. This function takes a single parameter, which is the index or roll label, to drop.

The basic syntax looks as follows:

df.drop(label, axis)

Axis can either be 0 (=drop rows) or 1 (= drop columns)

In [ ]:
purchase_1 = pd.Series({'Name': 'Matthias',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Thomas',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Christina',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame(data = [purchase_1, purchase_2, purchase_3], index=["Store 1", "Store 2", "Store 3"])
df # define our dataset from before
Out[ ]:
Name Item Purchased Cost
Store 1 Matthias Dog Food 22.5
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0
In [ ]:
df.drop("Store 1")
Out[ ]:
Name Item Purchased Cost
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0
In [ ]:
df.drop("Cost", axis=1)
Out[ ]:
Name Item Purchased
Store 1 Matthias Dog Food
Store 2 Thomas Kitty Litter
Store 3 Christina Bird Seed

The drop function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows removed.

We can see that our original DataFrame is still intact.

In [ ]:
df
Out[ ]:
Name Item Purchased Cost
Store 1 Matthias Dog Food 22.5
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0

Let's make a copy with the copy method and do a drop on it instead.

In [ ]:
copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df
Out[ ]:
Name Item Purchased Cost
Store 2 Thomas Kitty Litter 2.5
Store 3 Christina Bird Seed 5.0

This is a very typical pattern in Pandas, where in-place changes to a DataFrame are only done if need be, usually on changes involving indices.

In [ ]:
copy_df.drop? # let's take a further look at drop

Drop has two interesting optional parameters.

  • The first is called in place, and if it's set to true, the DataFrame will be updated in place, instead of a copy being returned.

  • The second parameter is the axes, which should be dropped.

    • By default, this value is 0, indicating the row axes.

    • But you could change it to 1 if you want to drop a column.

Ranking a DataFrame

You can rank a DataFrames according to specific values within a column. Let's explore the rank function.

In [ ]:
df.rank?

We can see that rank has some interesting parameters:

  • Axis describes whether we rank rows (0) or columns(1)
  • The method parameter specifies rules that apply if certain elements have the same value. It defaults to "average" which means that it takes the average for records that have the same rank.
  • na_option specifies what to do with NaN values. Bottom for example, pushes these values to the end of the ranking.
  • with pct set to True, we can rank according to the percentile of the values.
In [ ]:
# create example dataframe
foo = pd.DataFrame(data={'Company': ['Apple',
'Google', 'Intel', 'AMD', 'Startup'], '% Growth':
[4, 2, 4, 8, np.nan]})
foo
Out[ ]:
Company % Growth
0 Apple 4.0
1 Google 2.0
2 Intel 4.0
3 AMD 8.0
4 Startup NaN
In [ ]:
foo['default'] = foo['% Growth'].rank()
foo['max'] = foo['% Growth'].rank(method='max')
foo['NA_bottom'] = foo['% Growth'].rank(na_option='bottom')
foo['pct'] = foo['% Growth'].rank(pct=True)

foo
Out[ ]:
Company % Growth default max NA_bottom pct
0 Apple 4.0 2.5 3.0 2.5 0.625
1 Google 2.0 1.0 1.0 1.0 0.250
2 Intel 4.0 2.5 3.0 2.5 0.625
3 AMD 8.0 4.0 4.0 4.0 1.000
4 Startup NaN NaN NaN 5.0 NaN

Sorting DataFrames

with pd.sort_values we can sort our dataframe according to specified parameters. Again, we must be careful to specify the correct axis!

df.sort_values(by=[str or list to sort by], axist)

In [ ]:
foo.sort_values(by= "% Growth", ascending=False)
Out[ ]:
Company % Growth default max NA_bottom pct
3 AMD 8.0 4.0 4.0 4.0 1.000
0 Apple 4.0 2.5 3.0 2.5 0.625
2 Intel 4.0 2.5 3.0 2.5 0.625
1 Google 2.0 1.0 1.0 1.0 0.250
4 Startup NaN NaN NaN 5.0 NaN

Note, that the index changes according to the rows.

Altering the Dataframe

Let's explore how we can make "on-the-go" changes to a DataFrame.

In [ ]:
df = pd.DataFrame([{'Name': 'Matthias', 'Item Purchased': 'Sponge', 'Cost': 22.50}, 
                  {'Name': 'Thomas', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50}, 
                   {'Name': 'Christina', 'Item Purchased': 'Spoon', 'Cost': 5.00}], 
                  index=['Store 1', 'Store 1', 'Store 2'])
df
Out[ ]:
Name Item Purchased Cost
Store 1 Matthias Sponge 22.5
Store 1 Thomas Kitty Litter 2.5
Store 2 Christina Spoon 5.0

Adding Columns

You can add new columns. Just define the data you want to have in that column. Note that it must correspond to the number of rows in your DataFrame!

In [ ]:
df['Date'] = ['December 1', 'January 1', 'mid-May']
df
Out[ ]:
Name Item Purchased Cost Date
Store 1 Matthias Sponge 22.5 December 1
Store 1 Thomas Kitty Litter 2.5 January 1
Store 2 Christina Spoon 5.0 mid-May

Adding Single Values

You can also create a column and fill it with a single value. This is often used to create a placeholder, which will be altered later on.

In [ ]:
df["Delivered"] = np.nan
df
Out[ ]:
Name Item Purchased Cost Date Delivered
Store 1 Matthias Sponge 22.5 December 1 NaN
Store 1 Thomas Kitty Litter 2.5 January 1 NaN
Store 2 Christina Spoon 5.0 mid-May NaN
In [ ]:
# let's fill the column
df["Delivered"] = [True, True, False]
df
Out[ ]:
Name Item Purchased Cost Date Delivered
Store 1 Matthias Sponge 22.5 December 1 True
Store 1 Thomas Kitty Litter 2.5 January 1 True
Store 2 Christina Spoon 5.0 mid-May False

Add Series as a Column

Remember that we said earlier that, in essence, a Dataframe consists of several combined Series. Therefore, it should be possible to add an entire Series as a column to a Dataframe.

In [ ]:
#create series object

s = pd.Series(["Positive", "Negative"])
s
Out[ ]:
0    Positive
1    Negative
dtype: object
In [ ]:
df["Feedback"] = s
df
Out[ ]:
Name Item Purchased Cost Date Delivered Feedback
Store 1 Matthias Sponge 22.5 December 1 True NaN
Store 1 Thomas Kitty Litter 2.5 January 1 True NaN
Store 2 Christina Spoon 5.0 mid-May False NaN

Note that the Feedback column did not take in our values. What happened?

Well, our Series s and the DataFrame df use different indexes ([0,1] and [Store 1, Store 2, Store 3] respectively).

One solution would be resetting the index of our DataFrame via reset_index. This stores our current index as a column and defaults to the standart index. Now the index of our df and s are the same and we can re-insert the values.

In [ ]:
df = df.reset_index() #reset index
df
Out[ ]:
index Name Item Purchased Cost Date Delivered Feedback
0 Store 1 Matthias Sponge 22.5 December 1 True NaN
1 Store 1 Thomas Kitty Litter 2.5 January 1 True NaN
2 Store 2 Christina Spoon 5.0 mid-May False NaN
In [ ]:
df["Feedback"] = s # re-insert values
df
Out[ ]:
index Name Item Purchased Cost Date Delivered Feedback
0 Store 1 Matthias Sponge 22.5 December 1 True Positive
1 Store 1 Thomas Kitty Litter 2.5 January 1 True Negative
2 Store 2 Christina Spoon 5.0 mid-May False NaN

Indexing DataFrames

The example above shows that it can be necessary to change or reset the indices of a DataFrame.

As we have seen, both Series and DataFrames can have indices applied to them. The index is essentially a row level label, and we know that rows correspond to axis zero.

Indices can either be inferred, such as when we create a new series without an index, in which case we get numeric values, or they can be set explicitly, like when we use an dictionary object to create the series.

Let's re-create our original purchase dataset, and define the indices explicitly.

In [ ]:
df = pd.DataFrame([{'Name': 'Matthias', 'Item Purchased': 'Sponge', 'Cost': 22.50}, 
                  {'Name': 'Thomas', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50}, 
                   {'Name': 'Christina', 'Item Purchased': 'Spoon', 'Cost': 5.00}], 
                  index=['Store 1', 'Store 1', 'Store 2'])
df
Out[ ]:
Name Item Purchased Cost
Store 1 Matthias Sponge 22.5
Store 1 Thomas Kitty Litter 2.5
Store 2 Christina Spoon 5.0

Set an Index

One option for setting an index is to use the set_index() function. This function takes a list of columns and promotes those columns to an index. set_index() is a destructive process, it doesn't keep the current index. If you want to keep the current index, you need to manually create a new column and copy into it values from the index attribute.

Let's go back to our DataFrame. Let's say that we don't want to index the DataFrame by Stores, but instead want to index by the Names of the Customers. First we need to preserve the Store information into a new column. We can do this using the indexing operator or the string that has the column label. Then we can use set_index() to set the index of the column to the customer names.

In [ ]:

In [ ]:
df['Stores'] = df.index
df = df.set_index('Name') #this is a destructive oricess
df
Out[ ]:
Item Purchased Cost Stores
Name
Matthias Sponge 22.5 Store 1
Thomas Kitty Litter 2.5 Store 1
Christina Spoon 5.0 Store 2

You'll see that when we create a new index from an existing column it appears that a new first row has been added with empty values. This isn't quite what's happening. And we know this in part because an empty value is actually rendered either as a none or an NaN if the data type of the column is numeric. What's actually happened is that the index has a name.

Reset Index

We can get rid of the index completely by calling the function reset_index(). This promotes the index into a column and creates a default numbered index.

In [ ]:
df = df.reset_index()
df
Out[ ]:
Name Item Purchased Cost Stores
0 Matthias Sponge 22.5 Store 1
1 Thomas Kitty Litter 2.5 Store 1
2 Christina Spoon 5.0 Store 2