Open In Colab

Further Plots: Historgram, Boxplot & Heatmap

Histograms

Matplotlib histogram is used to visualize the frequency distribution of a numeric array by splitting it to small equal-sized bins.

The pyplot.hist() in matplotlib lets you draw the histogram. It requires an array as the required input and you can specify the number of bins needed. Here the bins represent the individual "lines" in an histogram that map the distribution of the data.

In [ ]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
In [ ]:
# create some normally distributed input data x of size 1000
x = np.random.normal(size = 1000)

# plot the normal distribution using 50 lines (bins)
plt.hist(x, bins=50)
plt.gca().set(title='Frequency Histogram', ylabel='Frequency')
Out[ ]:
[Text(0, 0.5, 'Frequency'), Text(0.5, 1.0, 'Frequency Histogram')]

Look at the following example and try to understand every single line of code (take your time).

In [ ]:
#create 2x2 subplot with shared x-axis
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex=True)
axs = [ax1,ax2,ax3,ax4]

# draw n random samples and plot the histograms
for n in range(0,len(axs)):
  sample_size = 10**(n+1)
  sample = np.random.normal(loc=0.0, scale=1.0, size=sample_size)
  axs[n].hist(sample, bins=100)
  axs[n].set_title('n={}'.format(sample_size))

Boxplot

We can also create more advance types of plots - such as the boxplot. As this plot is rather advanced, it is not likely to be a topic in the exam. If you are nonetheless interested in how it works, feel free to scroll through this tutorial.

In [ ]:
def annotate_boxplot(bpdict, annotate_params=None, x_offset=0.05, x_loc=0, text_offset_x=35, text_offset_y=20):
  annotate_params = dict(xytext=(text_offset_x, text_offset_y), textcoords='offset points', arrowprops={'arrowstyle':'->'})
  plt.annotate('Median', (x_loc + 1 + x_offset, bpdict['medians'][x_loc].get_ydata()[0]),
               **annotate_params)
  plt.annotate('25%', (x_loc + 1 + x_offset, bpdict['boxes'][x_loc].get_ydata()[0]), **annotate_params)
  plt.annotate('75%', (x_loc + 1 + x_offset, bpdict['boxes'][x_loc].get_ydata()[2]), **annotate_params)
  plt.annotate('5%', (x_loc + 1 + x_offset, bpdict['caps'][x_loc*2].get_ydata()[0]), **annotate_params)
  plt.annotate('95%', (x_loc + 1 + x_offset, bpdict['caps'][(x_loc*2)+1].get_ydata()[0]), **annotate_params)
In [ ]:
import pandas as pd
df = pd.DataFrame({'col1': np.random.normal(size=100), 'col2': np.random.normal(scale=2, size=100)})

bpdict = df.boxplot(whis=[5, 95], return_type='dict')
annotate_boxplot(bpdict, x_loc=1)

Heatmap

A heatmap is another great way to show a distribution of some data. It is created using the plt.hist2d() operator which takes severally differently distributed datapsets as an input and shows their distribution using colour. In general, the warmer an area is, the more datapoints exist at that certain coordinate.

In [ ]:
# return 100000 random values beteween 0 and 1 that are normally distributed
Y = np.random.normal(loc=0.0, scale=1.0, size=10000)

# return 100000 random values beteween 0 and 1
X = np.random.random(size=10000)


# use cmap parameter to change colour scheme, possilbe colours schemes can be found online
plt.hist2d(X, Y, bins=25, cmap = "cool")

plt.show()