数据可视化基础专题（三十二）：Pandas基础（十二）分组（一）Splitting an object into groups

1 简介 Group by: split-apply-combine

By “group by” we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.

Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to do one of the following:

Aggregation: compute a summary statistic (or statistics) for each group. Some examples:
- Compute group sums or means.
- Compute group sizes / counts.
Transformation: perform some group-specific computations and return a like-indexed object. Some examples:
- Standardize data (zscore) within a group.
- Filling NAs within groups with a value derived from each group.
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discard data that belongs to groups with only a few members.
- Filter out data based on the group sum or mean.
Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.

Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools), in which you can write code like:

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2

We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality then provide some non-trivial examples / use cases.

See the cookbook for some advanced strategies.

2 Splitting an object into groups

pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:

In [1]: df = pd.DataFrame(
   ...:     [
   ...:         ("bird", "Falconiformes", 389.0),
   ...:         ("bird", "Psittaciformes", 24.0),
   ...:         ("mammal", "Carnivora", 80.2),
   ...:         ("mammal", "Primates", np.nan),
   ...:         ("mammal", "Carnivora", 58),
   ...:     ],
   ...:     index=["falcon", "parrot", "lion", "monkey", "leopard"],
   ...:     columns=("class", "order", "max_speed"),
   ...: )
   ...: 

In [2]: df
Out[2]: 
          class           order  max_speed
falcon     bird   Falconiformes      389.0
parrot     bird  Psittaciformes       24.0
lion     mammal       Carnivora       80.2
monkey   mammal        Primates        NaN
leopard  mammal       Carnivora       58.0

# default is axis=0
In [3]: grouped = df.groupby("class")

In [4]: grouped = df.groupby("order", axis="columns")

In [5]: grouped = df.groupby(["class", "order"])

The mapping can be specified many different ways:

A Python function, to be called on each of the axis labels.
A list or NumPy array of the same length as the selected axis.
A dict or Series, providing a label -> group name mapping.
For DataFrame objects, a string indicating either a column name or an index level name to be used to group.
df.groupby('A') is just syntactic sugar for df.groupby(df['A']).
A list of any of the above things.

Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:

In [6]: df = pd.DataFrame(
   ...:     {
   ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
   ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
   ...:         "C": np.random.randn(8),
   ...:         "D": np.random.randn(8),
   ...:     }
   ...: )
   ...: 

In [7]: df
Out[7]: 
     A      B         C         D
0  foo    one  0.469112 -0.861849
1  bar    one -0.282863 -2.104569
2  foo    two -1.509059 -0.494929
3  bar  three -1.135632  1.071804
4  foo    two  1.212112  0.721555
5  bar    two -0.173215 -0.706771
6  foo    one  0.119209 -1.039575
7  foo  three -1.044236  0.271860

On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either the A or B columns, or both:

In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

New in version 0.24.

If we also have a MultiIndex on columns A and B, we can group by all but the specified columns

In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()
Out[12]: 
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

These will split the DataFrame on its index (rows). We could also split by the columns:

In [13]: def get_letter_type(letter):
   ....:     if letter.lower() in 'aeiou':
   ....:         return 'vowel'
   ....:     else:
   ....:         return 'consonant'
   ....: 

In [14]: grouped = df.groupby(get_letter_type, axis=1)

pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values:

In [15]: lst = [1, 2, 3, 1, 2, 3]

In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)

In [17]: grouped = s.groupby(level=0)

In [18]: grouped.first()
Out[18]: 
1    1
2    2
3    3
dtype: int64

In [19]: grouped.last()
Out[19]: 
1    10
2    20
3    30
dtype: int64

In [20]: grouped.sum()
Out[20]: 
1    11
2    22
3    33
dtype: int64

Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid mapping.

3 GroupBy sorting

By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups:

In [21]: df2 = pd.DataFrame({"X": ["B", "B", "A", "A"], "Y": [1, 2, 3, 4]})

In [22]: df2.groupby(["X"]).sum()
Out[22]: 
   Y
X   
A  7
B  3

In [23]: df2.groupby(["X"], sort=False).sum()
Out[23]: 
   Y
X   
B  3
A  7

Note that groupby will preserve the order in which observations are sorted within each group. For example, the groups created by groupby() below are in the order they appeared in the original DataFrame:

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2

4 GroupBy dropna

By default NA values are excluded from group keys during the groupby operation. However, in case you want to include NA values in group keys, you could pass dropna=False to achieve it.

In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]

In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

In [29]: df_dropna
Out[29]: 
   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2

# Default ``dropna`` is set to True, which will exclude NaNs in keys
In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]: 
     a  c
b        
1.0  2  3
2.0  2  5

# In order to allow NaN in keys, set ``dropna`` to False
In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]: 
     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4

The default setting of dropna argument is True which means NA are not included in group keys.

5 GroupBy object attributes

The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. In the above example we have:

In [32]: df.groupby("A").groups
Out[32]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}

In [33]: df.groupby(get_letter_type, axis=1).groups
Out[33]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}

Calling the standard Python len function on the GroupBy object just returns the length of the groups dict, so it is largely just a convenience:

In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)
Out[36]: 6

GroupBy will tab complete column names (and other attributes):

In [37]: df
Out[37]: 
               height      weight  gender
2000-01-01  42.849980  157.500553    male
2000-01-02  49.607315  177.340407    male
2000-01-03  56.293531  171.524640    male
2000-01-04  48.421077  144.251986  female
2000-01-05  46.556882  152.526206    male
2000-01-06  68.448851  168.272968  female
2000-01-07  70.757698  136.431469    male
2000-01-08  58.909500  176.499753  female
2000-01-09  76.435631  174.094104  female
2000-01-10  45.306120  177.540920    male

In [38]: gb = df.groupby("gender")

In [39]: gb.<TAB>  # noqa: E225, E999
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

6 GroupBy with MultiIndex

With hierarchically-indexed data, it’s quite natural to group by one of the levels of the hierarchy.

Let’s create a Series with a two-level MultiIndex.

In [40]: arrays = [
   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [42]: s = pd.Series(np.random.randn(8), index=index)

In [43]: s
Out[43]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

We can then group by one of the levels in s.

In [44]: grouped = s.groupby(level=0)

In [45]: grouped.sum()
Out[45]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

If the MultiIndex has names specified, these can be passed instead of the level number:

In [46]: s.groupby(level="second").sum()
Out[46]: 
second
one    0.980950
two    1.991575
dtype: float64

The aggregation functions such as sum will take the level parameter directly. Additionally, the resulting index will be named according to the chosen level:

In [47]: s.sum(level="second")
Out[47]: 
second
one    0.980950
two    1.991575
dtype: float64

Grouping with multiple levels is supported.

In [48]: s
Out[48]: 
first  second  third
bar    doo     one     -1.131345
               two     -0.089329
baz    bee     one      0.337863
               two     -0.945867
foo    bop     one     -0.932132
               two      1.956030
qux    bop     one      0.017587
               two     -0.016692
dtype: float64

In [49]: s.groupby(level=["first", "second"]).sum()
Out[49]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64

Index level names may be supplied as keys.

In [50]: s.groupby(["first", "second"]).sum()
Out[50]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64

More on the sum function and aggregation later.

7 Grouping DataFrame with Index levels and columns

A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index levels as pd.Grouper objects.

In [51]: arrays = [
   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [52]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [53]: df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)

In [54]: df
Out[54]: 
              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

The following example groups df by the second index level and the A column.

In [55]: df.groupby([pd.Grouper(level=1), "A"]).sum()
Out[55]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

Index levels may also be specified by name.

In [56]: df.groupby([pd.Grouper(level="second"), "A"]).sum()
Out[56]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

Index level names may be specified as keys directly to groupby.

In [57]: df.groupby(["second", "A"]).sum()
Out[57]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

8 DataFrame column selection in GroupBy

Once you have created the GroupBy object from a DataFrame, you might want to do something different for each of the columns. Thus, using [] similar to getting a column from a DataFrame, you can do:

In [58]: grouped = df.groupby(["A"])

In [59]: grouped_C = grouped["C"]

In [60]: grouped_D = grouped["D"]

This is mainly syntactic sugar for the alternative and much more verbose:

In [61]: df["C"].groupby(df["A"])
Out[61]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc1ed01ca90>

Additionally this method avoids recomputing the internal grouping information derived from the passed key.