Open In Colab

Plotting with Pandas

The data we want to visualize often comes in form of a pandas DataFrame. Let's have a look how DataFrames and our plotting library interact. The interaction is actually rather easy. Say we have any give DataFrame df with some sample data. We can then simply plot this dataframe by typing df.plot()

In [ ]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
In [ ]:
#create a sample dataframe df

np.random.seed(123)
df = pd.DataFrame({'A': np.random.randn(365).cumsum(0),'B': np.random.randn(365).cumsum(0) + 20,
                   'C': np.random.randn(365).cumsum(0) - 20}, index=pd.date_range('04/27/2017', periods=365))
#plot df
df.plot()
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f12582f4b50>

Scatter Plots with Pandas

Within the df.plot() operator we can furthermore use the kind parameter to specify which kind of plot we wish to plot. Options include:

  • line plot (default)
  • 'bar' : vertical bar plot
  • 'barh' : horizontal bar plot
  • 'hist' : histogram
  • 'box' : boxplot
  • 'kde' : Kernel Density Estimation plot
  • 'density' : same as 'kde'
  • 'area' : area plot
  • 'pie' : pie plot
  • 'scatter' : scatter plot
  • 'hexbin

Let's try out the scatter plot

In [ ]:
df.plot("A", "B" ,kind= "scatter")
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f12518f2150>

Due to the fact that we have two axes as well as different colours for the datapoints in a scatterplot, this kind of plot is especially useful when visualizing three dimensional data in a linear fashion. Here just think of our iris dataset from the beginning:

In [ ]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris()

df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
df.drop(labels=["petal length (cm)", "petal width (cm)"], axis=1)



x_index = 0
y_index = 1

# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])

plt.figure(figsize=(8, 6))
plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
cb = plt.colorbar(ticks=[0, 1, 2], format=formatter)
cb.ax.set_ylabel('Flower Species', rotation=270, labelpad = 15)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])

plt.tight_layout()
plt.show()

Another way to create a scatter plot from a dataframe would be the use of the df.plot.scatter() parameter. The use is quite similar to above.

In [ ]:
np.random.seed(123)
df = pd.DataFrame({'Length': np.random.randn(365).cumsum(0),
                   'Width': np.random.randn(365).cumsum(0) + 20,
                   'Height': np.random.randn(365).cumsum(0) - 20},
index=pd.date_range('04/27/2020', periods=365))
ax = df.plot.scatter('Length', 'Height', c='Width', cmap='viridis')
ax.set_aspect("equal")

3D Scatterplot

In case we have higher dimensional data, it might be useful to plot a 3D graph. As this involves rather advanced python, this is not going to be relevant for the exam. The following example should nonetheless give you an idea of how it works:

In [ ]:
from mpl_toolkits.mplot3d import Axes3D

#create sample dataframe with 3Dimensions
np.random.seed(123)
df = pd.DataFrame({'Length': np.random.randn(365).cumsum(0), 'Width': np.random.randn(365).cumsum(0) + 20,
                   'Height': np.random.randn(365).cumsum(0) - 20},
index=pd.date_range('04/27/2020', periods=365))

#create 3D scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df["Length"].values, df["Height"].values, df["Width"].values, c=df["Width"].values)
ax.set_xlabel('Length');ax.set_ylabel('Height');ax.set_zlabel('Width')
Out[ ]:
Text(0.5, 0, 'Width')

Note: Pandas does not have an operator df.plot() for 3D plots

BoxPlot with Pandas

Using the opeartor df.plot.box() one can furthermore create a boxplot out of a dataframe

In [ ]:
np.random.seed(123)
df = pd.DataFrame({'Length': np.random.randn(365).cumsum(0), 'Width': np.random.randn(365).cumsum(0) + 20, 'Height': np.random.randn(365).cumsum(0) - 20},
index=pd.date_range('04/27/2020', periods=365))

#create box plot
df.plot.box()
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1238e5bd50>