Open In Colab

Data Wrangling with Pandas

In the previous chapter, we dove into detail on NumPy and its ndarray object, which provides efficient storage and manipulation of dense typed arrays in Python. Here we'll build on this knowledge by looking in detail at the data structures provided by the Pandas library. Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

Installing and Using Pandas

Installation of Pandas on your system requires NumPy to be installed. Details on this installation can be found in the Pandas documentation. Both can be installed via the pipcommand.

Once Pandas is installed, you can import it and check the version:

In [ ]:
import pandas
pandas.__version__
Out[ ]:
'1.1.5'

Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd:

In [ ]:
import pandas as pd

This import convention will be used throughout the remainder of this course.

Reminder about Built-In Documentation

As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ? character).

For example, to display all the contents of the pandas namespace, you can type

ipython
In [3]: pd.<TAB>

And to display Pandas's built-in documentation, you can use this:

In [ ]:
pd?

More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.