< Previous Page | Home Page | Next Page >
In many cases, one must do more than just accessing certain rows and columns of a dataframe. Let's look at some further applications of our pandas library.
Often datasets have empty fields due to lack of data or wrong data extraction. These fields are filled with None type values which Pandas automatically converts to a special floating point value designated as NAN, which stands for not a number.
If we create a list of numbers, integers or floats, and put in the None type, Pandas automatically converts this to a special floating point value designated as NAN, which stands for not a number.
import pandas as pd
import numpy as np
numbers = [1, 2, None]
pd.Series(numbers)
Before processing our data, we often want to "clean" it. Here, we can replace all NaN fields with a real number of our choice with numpy.fillna(number)
.
df = pd.DataFrame(data={'Company': ['Apple', 'Google', 'Intel', 'AMD', 'Startup'], '% Growth':[4, 2, 4, 8, np.nan]})
df = (df.fillna(0))
df
It's easy to delete data in series and DataFrames, and we can use the drop function to do so. This function takes a single parameter, which is the index or roll label, to drop.
The basic syntax looks as follows:
df.drop(label, axis)
Axis can either be 0 (=drop rows) or 1 (= drop columns)
purchase_1 = pd.Series({'Name': 'Matthias',
'Item Purchased': 'Dog Food',
'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Thomas',
'Item Purchased': 'Kitty Litter',
'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Christina',
'Item Purchased': 'Bird Seed',
'Cost': 5.00})
df = pd.DataFrame(data = [purchase_1, purchase_2, purchase_3], index=["Store 1", "Store 2", "Store 3"])
df # define our dataset from before
df.drop("Store 1")
df.drop("Cost", axis=1)
The drop function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows removed.
We can see that our original DataFrame is still intact.
df
Let's make a copy with the copy method and do a drop on it instead.
copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df
This is a very typical pattern in Pandas, where in-place changes to a DataFrame are only done if need be, usually on changes involving indices.
copy_df.drop? # let's take a further look at drop
Drop has two interesting optional parameters.
The first is called in place, and if it's set to true, the DataFrame will be updated in place, instead of a copy being returned.
The second parameter is the axes, which should be dropped.
By default, this value is 0, indicating the row axes.
But you could change it to 1 if you want to drop a column.
You can rank a DataFrames according to specific values within a column. Let's explore the rank
function.
df.rank?
We can see that rank has some interesting parameters:
# create example dataframe
foo = pd.DataFrame(data={'Company': ['Apple',
'Google', 'Intel', 'AMD', 'Startup'], '% Growth':
[4, 2, 4, 8, np.nan]})
foo
foo['default'] = foo['% Growth'].rank()
foo['max'] = foo['% Growth'].rank(method='max')
foo['NA_bottom'] = foo['% Growth'].rank(na_option='bottom')
foo['pct'] = foo['% Growth'].rank(pct=True)
foo
with pd.sort_values
we can sort our dataframe according to specified parameters. Again, we must be careful to specify the correct axis!
df.sort_values(by=[str or list to sort by], axist)
foo.sort_values(by= "% Growth", ascending=False)
Note, that the index changes according to the rows.
Let's explore how we can make "on-the-go" changes to a DataFrame.
df = pd.DataFrame([{'Name': 'Matthias', 'Item Purchased': 'Sponge', 'Cost': 22.50},
{'Name': 'Thomas', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
{'Name': 'Christina', 'Item Purchased': 'Spoon', 'Cost': 5.00}],
index=['Store 1', 'Store 1', 'Store 2'])
df
You can add new columns. Just define the data you want to have in that column. Note that it must correspond to the number of rows in your DataFrame!
df['Date'] = ['December 1', 'January 1', 'mid-May']
df
You can also create a column and fill it with a single value. This is often used to create a placeholder, which will be altered later on.
df["Delivered"] = np.nan
df
# let's fill the column
df["Delivered"] = [True, True, False]
df
Remember that we said earlier that, in essence, a Dataframe consists of several combined Series. Therefore, it should be possible to add an entire Series as a column to a Dataframe.
#create series object
s = pd.Series(["Positive", "Negative"])
s
df["Feedback"] = s
df
Note that the Feedback column did not take in our values. What happened?
Well, our Series s and the DataFrame df use different indexes ([0,1] and [Store 1, Store 2, Store 3] respectively).
One solution would be resetting the index of our DataFrame via reset_index
. This stores our current index as a column and defaults to the standart index. Now the index of our df and s are the same and we can re-insert the values.
df = df.reset_index() #reset index
df
df["Feedback"] = s # re-insert values
df
The example above shows that it can be necessary to change or reset the indices of a DataFrame.
As we have seen, both Series and DataFrames can have indices applied to them. The index is essentially a row level label, and we know that rows correspond to axis zero.
Indices can either be inferred, such as when we create a new series without an index, in which case we get numeric values, or they can be set explicitly, like when we use an dictionary object to create the series.
Let's re-create our original purchase dataset, and define the indices explicitly.
df = pd.DataFrame([{'Name': 'Matthias', 'Item Purchased': 'Sponge', 'Cost': 22.50},
{'Name': 'Thomas', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
{'Name': 'Christina', 'Item Purchased': 'Spoon', 'Cost': 5.00}],
index=['Store 1', 'Store 1', 'Store 2'])
df
One option for setting an index is to use the set_index()
function. This function takes a list of columns and promotes those columns to an index. set_index()
is a destructive process, it doesn't keep the current index. If you want to keep the current index, you need to manually create a new column and copy into it values from the index attribute.
Let's go back to our DataFrame. Let's say that we don't want to index the DataFrame by Stores, but instead want to index by the Names of the Customers. First we need to preserve the Store information into a new column. We can do this using the indexing operator or the string that has the column label. Then we can use set_index()
to set the index of the column to the customer names.
df['Stores'] = df.index
df = df.set_index('Name') #this is a destructive oricess
df
You'll see that when we create a new index from an existing column it appears that a new first row has been added with empty values. This isn't quite what's happening. And we know this in part because an empty value is actually rendered either as a none or an NaN
if the data type of the column is numeric. What's actually happened is that the index has a name.
We can get rid of the index completely by calling the function reset_index()
. This promotes the index into a column and creates a default numbered index.
df = df.reset_index()
df
< Previous Page | Home Page | Next Page >