< Previous Page | Home Page | Next Page >
An essential piece of analysis of large data is efficient summarization: computing aggregations like sum()
, mean()
, median()
, min()
, and max()
, in which a single number gives insight into the nature of a potentially large dataset.
In this section, we'll explore aggregations in Pandas.
import numpy as np
import pandas as pd
Here we will use the Planets dataset, available via the Seaborn package (see Visualization With Seaborn). It gives information on planets that astronomers have discovered around other stars (known as extrasolar planets or exoplanets for short). It can be downloaded with a simple Seaborn command:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
planets.head()
This has some details on the 1,000+ extrasolar planets discovered up to 2014.
Earlier, we explored some of the data aggregations available for NumPy arrays.
As with a one-dimensional NumPy array, for a Pandas Series
the aggregates return a single value:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
ser.sum()
ser.mean()
For a DataFrame
, by default the aggregates return results within each column:
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df
df.mean()
By specifying the axis
argument, you can instead aggregate within each row:
df.mean(axis='columns')
Pandas Series
and DataFrame
s include all of the common aggregates ; in addition, there is a convenience method describe()
that computes several common aggregates for each column and returns the result.
Let's use this on the Planets data, for now dropping rows with missing values:
planets.dropna().describe()
This can be a useful way to begin understanding the overall properties of a dataset.
For example, we see in the year
column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after.
The following table summarizes some other built-in Pandas aggregations:
Aggregation | Description |
---|---|
count() |
Total number of items |
first() , last() |
First and last item |
mean() , median() |
Mean and median |
min() , max() |
Minimum and maximum |
std() , var() |
Standard deviation and variance |
mad() |
Mean absolute deviation |
prod() |
Product of all items |
sum() |
Sum of all items |
These are all methods of DataFrame
and Series
objects.
To go deeper into the data, however, simple aggregates are often not enough.
The next level of data summarization is the groupby
operation, which allows you to quickly and efficiently compute aggregates on subsets of data.
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called groupby
operation.
groubby
has the following steps:
DataFrame
depending on the value of the specified key.While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that the intermediate splits do not need to be explicitly instantiated. Rather, the GroupBy
can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.
The power of the GroupBy
is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.
As a concrete example, let's take a look at using Pandas for the computation shown in this diagram.
We'll start by creating the input DataFrame
:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df
The most basic split-apply-combine operation can be computed with the groupby()
method of DataFrame
s, passing the name of the desired key column:
df.groupby('key')
Notice that what is returned is not a set of DataFrame
s, but a DataFrameGroupBy
object.
This object is where the magic is: you can think of it as a special view of the DataFrame
, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.
To produce a result, we can apply an aggregate to this DataFrameGroupBy
object, which will perform the appropriate apply/combine steps to produce the desired result:
df.groupby('key').sum()
The sum()
method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid DataFrame
operation, as we will see in the following discussion.
The GroupBy
object is a very flexible abstraction.
In many ways, you can simply treat it as if it's a collection of DataFrame
s, and it does the difficult things under the hood. Let's see some examples using the Planets data.
The GroupBy
object supports column indexing in the same way as the DataFrame
, and returns a modified GroupBy
object.
For example:
planets.groupby('method')
planets.groupby('method')['orbital_period']
Here we've selected a particular Series
group from the original DataFrame
group by reference to its column name.
As with the GroupBy
object, no computation is done until we call some aggregate on the object:
planets.groupby('method')['orbital_period'].median()
This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.
The GroupBy
object supports direct iteration over the groups, returning each group as a Series
or DataFrame
:
for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
This can be useful for doing certain things manually, though it is often much faster to use the built-in apply
functionality, which we will discuss momentarily.
The preceding discussion focused on aggregation for the combine operation, but there are more options available.
In particular, GroupBy
objects have aggregate()
, filter()
, transform()
, and apply()
methods that efficiently implement a variety of useful operations before combining the grouped data.
For the purpose of the following subsections, we'll use this DataFrame
:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
We're now familiar with GroupBy
aggregations with sum()
, median()
, and the like, but the aggregate()
method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once.
Here is a quick example combining all these:
df.groupby('key').aggregate(['min', np.median, max])
Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:
def filter_func(x):
return x['data2'].std() > 4
display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")
The filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result.
While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean:
df.groupby('key').transform(lambda x: x - x.mean())
The apply()
method lets you apply an arbitrary function to the group results.
The function should take a DataFrame
, and return either a Pandas object (e.g., DataFrame
, Series
) or a scalar; the combine operation will be tailored to the type of output returned.
For example, here is an apply()
that normalizes the first column by the sum of the second:
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
display('df', "df.groupby('key').apply(norm_by_data2)")
apply()
within a GroupBy
is quite flexible: the only criterion is that the function takes a DataFrame
and returns a Pandas object or scalar; what you do in the middle is up to you!
< Previous Page | Home Page | Next Page >