groupby

1）选择一个列或者一组列

l 对一张表直接进行分组操作，而不做其他聚合，显示结果如下：

Out[29]: df.groupby('key1')

<pandas.core.groupby.DataFrameGroupBy object at 0x000002BB1364A4A8>

如何显示：

（1）可以对分组进行迭代显示：

for name, group in df.groupby('key1'):

print(name)

print(group)

>>>

data1 data2 key1 key2

0 -0.204708 1.393406 a one

1 0.478943 0.092908 a two

4 1.965781 1.246435 a one

-------

data1 data2 key1 key2

2 -0.519439 0.281746 b one

3 -0.555730 0.769023 b two

-------

（2）转化为字典显示

dict(list(df.groupby('key1')))

>>>

{'a':

data1 data2 key1 key2

0 -0.204708 1.393406 a one

1 0.478943 0.092908 a two

4 1.965781 1.246435 a one ,

'b':

data1 data2 key1 key2

2 -0.519439 0.281746 b one

3 -0.555730 0.769023 b two }

l 按每列的数据类型分组

grouped = df.groupby(df.dtypes, axis=1)

dict(list(grouped))

{dtype('float64'): data1 data2

0 -0.204708 1.393406

1 0.478943 0.092908

2 -0.519439 0.281746

3 -0.555730 0.769023

4 1.965781 1.246435 ,

dtype('O'): key1 key2

0 a one

1 a two

2 b one

3 b two

4 a one }

2）通过字典或series进行分组

假如一张表的列名：columns=['a', 'b', 'c', 'd', 'e']

定义一个字典：mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}

by_column = people.groupby(mapping, axis=1)

by_column.sum()

>>>

blue red

Joe 0.503905 1.063885

Steve 1.297183 -1.553778

Wes -1.021228 -1.116829

Jim 0.524712 1.770545

Travis -4.230992 -2.405455

map_series = Series(mapping)

people.groupby(map_series, axis=1).count()

3）按行进行分组

本质上还是按列进行聚合，即按照索引进行聚合，同按列进行分组

key=['ss','kk','kk','ss','ss'] #给定index分组标记

data.groupby(key).mean() #mean是按key做分组的列均值 groupby(key)默认axis=0

4）数据的聚合

(1) quantile

df=

data1 data2 key1 key2

0 -0.204708 1.393406 a one

1 0.478943 0.092908 a two

2 -0.519439 0.281746 b one

3 -0.555730 0.769023 b two

4 1.965781 1.246435 a one

grouped = df.groupby('key1')

grouped['data1'].quantile(0.9)

(2)agg()

grouped = df.groupby('key1')

def peak_to_peak(arr):

return arr.max() - arr.min()

grouped.agg(peak_to_peak)

>>>

data1 data2

key1

a 2.170488 1.300498

b 0.036292 0.487276

Tip:

agg()和apply()区别

agg函数内调用的函数只能对分组进行聚合使用，apply的应用更广泛

agg()使用的多种形式：

grouped_pct.agg(['mean', 'std', peak_to_peak])

grouped_pct.agg([('foo', 'mean'), ('bar', np.std)]) #给计算结果一个别名

functions = ['count', 'mean', 'max']

result = grouped['tip_pct', 'total_bill'].agg(functions) #列表形式

ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]

grouped['tip_pct', 'total_bill'].agg(ftuples) #元组形式

grouped.agg({'tip' : np.max, 'size' : 'sum'}) #字典形式

grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'], 'size' : 'sum'}) #混合形式

(3)transform()

我们经常在groupby之后使用aggregate , filter 或 apply来汇总数据，transform则不对数据进行聚合，它会在对应行的位置生成函数值。

举例：

data=

a b c d e

li 1 2 3 4 5

chen 2 1 1 2 2

wang 1 2 3 4 5

zhao 2 1 1 2 2

qian 1 2 3 4 5

l 使用一般函数，会对数据进行聚合

key=['ss','kk','kk','ss','ss'] #给定index分组标记

print(data.groupby(key).mean()) #mean是按key做分组的列均值

>>>

a b c d e

kk 1.500000 1.500000 2.000000 3.000000 3.5

ss 1.333333 1.666667 2.333333 3.333333 4.0

l 使用transform(),不会对数据进行聚合

data.groupby(key).transform(np.mean)

#data里每个位置元素取对应分组列的均值

>>>

a b c d e

li 1.333333 1.666667 2.333333 3.333333 4.0

chen 1.500000 1.500000 2.000000 3.000000 3.5

wang 1.500000 1.500000 2.000000 3.000000 3.5

zhao 1.333333 1.666667 2.333333 3.333333 4.0

qian 1.333333 1.666667 2.333333 3.333333 4.0

5）数据透视表

假设我想要根据sex和smoker计算分组平均数（pivot_table的默认聚合类型），并将sex和smoker放到行上：

# 方法一：使用groupby

tips.groupby(['sex', 'smoker']).mean()

# 方法二：使用pivot_table

tips.pivot_table(row=['sex', 'smoker'])

两种方法是一样的

现在假设我们只想聚合tip_pct和size，而且想根据day进行分组。我将smoker放到列上，把day放到行上：

tips.pivot_table(values=['tip_pct', 'size'], index=['sex', 'day'], columns='smoker')

要使用其他的聚合函数，将其传给参数aggfunc即可。例如，使用count或len可以得到有关分组大小的交叉表：

tips.pivot_table('tip_pct', index=['sex', 'smoker'], columns='day', aggfunc=len, margins=True)

6）交叉表

交叉表crosstab()是一种特殊的pivot_table()，专用于计算分组频率

下面两种方法是一样的

data.pivot_table(index=['Gender'], columns='Handedness', aggfunc=len, margins=True)

# 方法二：用crosstab

pd.crosstab(data.Gender, data.Handedness, margins=True)