< Previous Page | Home Page | Next Page >

Operations on DataFrames¶

In many cases, one must do more than just accessing certain rows and columns of a dataframe. Let's look at some further applications of our pandas library.

Replacing NaN¶

Often datasets have empty fields due to lack of data or wrong data extraction. These fields are filled with None type values which Pandas automatically converts to a special floating point value designated as NAN, which stands for not a number.

If we create a list of numbers, integers or floats, and put in the None type, Pandas automatically converts this to a special floating point value designated as NAN, which stands for not a number.

import pandas as pd
import numpy as np

numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

Before processing our data, we often want to "clean" it. Here, we can replace all NaN fields with a real number of our choice with numpy.fillna(number).

df = pd.DataFrame(data={'Company': ['Apple', 'Google', 'Intel', 'AMD', 'Startup'], '% Growth':[4, 2, 4, 8, np.nan]})

df = (df.fillna(0))
df

Dropping Data¶

It's easy to delete data in series and DataFrames, and we can use the drop function to do so. This function takes a single parameter, which is the index or roll label, to drop.

The basic syntax looks as follows:

df.drop(label, axis)

Axis can either be 0 (=drop rows) or 1 (= drop columns)

purchase_1 = pd.Series({'Name': 'Matthias',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Thomas',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Christina',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame(data = [purchase_1, purchase_2, purchase_3], index=["Store 1", "Store 2", "Store 3"])
df # define our dataset from before

df.drop("Store 1")

df.drop("Cost", axis=1)

The drop function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows removed.

We can see that our original DataFrame is still intact.

df

Let's make a copy with the copy method and do a drop on it instead.

copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df

This is a very typical pattern in Pandas, where in-place changes to a DataFrame are only done if need be, usually on changes involving indices.

copy_df.drop? # let's take a further look at drop

Drop has two interesting optional parameters.

The first is called in place, and if it's set to true, the DataFrame will be updated in place, instead of a copy being returned.
The second parameter is the axes, which should be dropped.
- By default, this value is 0, indicating the row axes.
- But you could change it to 1 if you want to drop a column.

Ranking a DataFrame¶

You can rank a DataFrames according to specific values within a column. Let's explore the rank function.

df.rank?

We can see that rank has some interesting parameters:

Axis describes whether we rank rows (0) or columns(1)
The method parameter specifies rules that apply if certain elements have the same value. It defaults to "average" which means that it takes the average for records that have the same rank.
na_option specifies what to do with NaN values. Bottom for example, pushes these values to the end of the ranking.
with pct set to True, we can rank according to the percentile of the values.

# create example dataframe
foo = pd.DataFrame(data={'Company': ['Apple',
'Google', 'Intel', 'AMD', 'Startup'], '% Growth':
[4, 2, 4, 8, np.nan]})
foo

foo['default'] = foo['% Growth'].rank()
foo['max'] = foo['% Growth'].rank(method='max')
foo['NA_bottom'] = foo['% Growth'].rank(na_option='bottom')
foo['pct'] = foo['% Growth'].rank(pct=True)

foo

Sorting DataFrames¶

with pd.sort_values we can sort our dataframe according to specified parameters. Again, we must be careful to specify the correct axis!

df.sort_values(by=[str or list to sort by], axist)

foo.sort_values(by= "% Growth", ascending=False)

Note, that the index changes according to the rows.

Altering the Dataframe¶

Let's explore how we can make "on-the-go" changes to a DataFrame.

df = pd.DataFrame([{'Name': 'Matthias', 'Item Purchased': 'Sponge', 'Cost': 22.50}, 
                  {'Name': 'Thomas', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50}, 
                   {'Name': 'Christina', 'Item Purchased': 'Spoon', 'Cost': 5.00}], 
                  index=['Store 1', 'Store 1', 'Store 2'])
df

Adding Columns¶

You can add new columns. Just define the data you want to have in that column. Note that it must correspond to the number of rows in your DataFrame!

df['Date'] = ['December 1', 'January 1', 'mid-May']
df

Adding Single Values¶

You can also create a column and fill it with a single value. This is often used to create a placeholder, which will be altered later on.

df["Delivered"] = np.nan
df

# let's fill the column
df["Delivered"] = [True, True, False]
df

Add Series as a Column¶

Remember that we said earlier that, in essence, a Dataframe consists of several combined Series. Therefore, it should be possible to add an entire Series as a column to a Dataframe.

#create series object

s = pd.Series(["Positive", "Negative"])
s

0    Positive
1    Negative
dtype: object

df["Feedback"] = s
df

Note that the Feedback column did not take in our values. What happened?

Well, our Series s and the DataFrame df use different indexes ([0,1] and [Store 1, Store 2, Store 3] respectively).

One solution would be resetting the index of our DataFrame via reset_index. This stores our current index as a column and defaults to the standart index. Now the index of our df and s are the same and we can re-insert the values.

df = df.reset_index() #reset index
df

df["Feedback"] = s # re-insert values
df

Indexing DataFrames¶

The example above shows that it can be necessary to change or reset the indices of a DataFrame.

As we have seen, both Series and DataFrames can have indices applied to them. The index is essentially a row level label, and we know that rows correspond to axis zero.

Indices can either be inferred, such as when we create a new series without an index, in which case we get numeric values, or they can be set explicitly, like when we use an dictionary object to create the series.

Let's re-create our original purchase dataset, and define the indices explicitly.

df = pd.DataFrame([{'Name': 'Matthias', 'Item Purchased': 'Sponge', 'Cost': 22.50}, 
                  {'Name': 'Thomas', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50}, 
                   {'Name': 'Christina', 'Item Purchased': 'Spoon', 'Cost': 5.00}], 
                  index=['Store 1', 'Store 1', 'Store 2'])
df

Set an Index¶

One option for setting an index is to use the set_index() function. This function takes a list of columns and promotes those columns to an index. set_index() is a destructive process, it doesn't keep the current index. If you want to keep the current index, you need to manually create a new column and copy into it values from the index attribute.

Let's go back to our DataFrame. Let's say that we don't want to index the DataFrame by Stores, but instead want to index by the Names of the Customers. First we need to preserve the Store information into a new column. We can do this using the indexing operator or the string that has the column label. Then we can use set_index() to set the index of the column to the customer names.

df['Stores'] = df.index
df = df.set_index('Name') #this is a destructive oricess
df

You'll see that when we create a new index from an existing column it appears that a new first row has been added with empty values. This isn't quite what's happening. And we know this in part because an empty value is actually rendered either as a none or an NaN if the data type of the column is numeric. What's actually happened is that the index has a name.

Reset Index¶

We can get rid of the index completely by calling the function reset_index(). This promotes the index into a column and creates a default numbered index.

df = df.reset_index()
df

< Previous Page | Home Page | Next Page >

	Name	Item Purchased	Cost
Store 1	Matthias	Dog Food	22.5
Store 2	Thomas	Kitty Litter	2.5
Store 3	Christina	Bird Seed	5.0

	Company	% Growth	default	max	NA_bottom	pct
0	Apple	4.0	2.5	3.0	2.5	0.625
1	Google	2.0	1.0	1.0	1.0	0.250
2	Intel	4.0	2.5	3.0	2.5	0.625
3	AMD	8.0	4.0	4.0	4.0	1.000
4	Startup	NaN	NaN	NaN	5.0	NaN

	Name	Item Purchased	Cost
Store 1	Matthias	Sponge	22.5
Store 1	Thomas	Kitty Litter	2.5
Store 2	Christina	Spoon	5.0

	Item Purchased	Cost	Stores
Name
Matthias	Sponge	22.5	Store 1
Thomas	Kitty Litter	2.5	Store 1
Christina	Spoon	5.0	Store 2