pandas 算术和函数

一、算术和广播

当对两个Series或者DataFrame对象进行算术运算的时候，返回的结果是两个对象的并集。如果存在某个索引不匹配时，将以缺失值NaN的方式体现，并对以后的操作产生影响。这类似数据库的外连接操作。

In [58]: s1 = pd.Series([4.2,2.6, 5.4, -1.9], index=list('acde'))
In [60]: s2 = pd.Series([-2.3, 1.2, 5.6, 7.2, 3.4], index= list('acefg'))
In [61]: s1
Out[61]:
a    4.2
c    2.6
d    5.4
e   -1.9
dtype: float64
In [62]: s2
Out[62]:
a   -2.3
c    1.2
e    5.6
f    7.2
g    3.4
dtype: float64
In [63]: s1+s2
Out[63]:
a    1.9
c    3.8
d    NaN
e    3.7
f    NaN
g    NaN
dtype: float64
In [64]: s1-s2
Out[64]:
a    6.5
c    1.4
d    NaN
e   -7.5
f    NaN
g    NaN
dtype: float64
In [65]: s1* s2
Out[65]:
a    -9.66
c     3.12
d      NaN
e   -10.64
f      NaN
g      NaN
dtype: float64
In [66]: df1 = pd.DataFrame(np.arange(9).reshape(3,3),columns=list('bcd'),index=['one','two','three'])
In [67]: df2 = pd.DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['two','three','five','six'])
In [68]: df1
Out[68]:
       b  c  d
one    0  1  2
two    3  4  5
three  6  7  8
In [69]: df2
Out[69]:
       b   d   e
two    0   1   2
three  3   4   5
five   6   7   8
six    9  10  11
In [70]: df1 + df2
Out[70]:
         b   c     d   e
five   NaN NaN   NaN NaN
one    NaN NaN   NaN NaN
six    NaN NaN   NaN NaN
three  9.0 NaN  12.0 NaN
two    3.0 NaN   6.0 NaN

其实，在上述过程中，为了防止NaN对后续的影响，很多时候我们要使用一些填充值：

In [71]: df1.add(df2, fill_value=0)
Out[71]:
         b    c     d     e
five   6.0  NaN   7.0   8.0
one    0.0  1.0   2.0   NaN
six    9.0  NaN  10.0  11.0
three  9.0  7.0  12.0   5.0
two    3.0  4.0   6.0   2.0
In [74]: df1.reindex(columns=df2.columns, fill_value=0) # 也可以这么干
Out[74]:
       b  d  e
one    0  2  0
two    3  5  0
three  6  8  0

注意，这里填充的意思是，如果某一方有值，另一方没有的话，将没有的那方的值填充为指定的参数值。而不是在最终结果中，将所有的NaN替换为填充值。

类似add的方法还有：

add：加法
sub：减法
div：除法
floordiv：整除
mul：乘法
pow：幂次方

DataFrame也可以和Series进行操作，这类似于numpy中不同维度数组间的操作，其中将使用广播机制。

DataFrame和Series之间的操作与numpy中的操作是类似的：

In [80]: df = pd.DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['one','two','three','four'])
In [81]: s = df.iloc[0]  # 取df的第一行生成一个Series
In [82]: df
Out[82]:
       b   d   e
one    0   1   2
two    3   4   5
three  6   7   8
four   9  10  11
In [83]: s
Out[83]:
b    0
d    1
e    2
Name: one, dtype: int32
In [84]: df - s # 减法会广播
Out[84]:
       b  d  e
one    0  0  0
two    3  3  3
three  6  6  6
four   9  9  9
In [85]: s2 = pd.Series(range(3), index=list('bef')) 
In [86]: df + s2  # 如果存在不匹配的列索引，则引入缺失值
Out[86]:
         b   d     e   f
one    0.0 NaN   3.0 NaN
two    3.0 NaN   6.0 NaN
three  6.0 NaN   9.0 NaN
four   9.0 NaN  12.0 NaN
In [87]: s3 = df['d'] # 取df的一列
In [88]: s3
Out[88]:
one       1
two       4
three     7
four     10
Name: d, dtype: int32
In [89]: df.sub(s3, axis='index')  # 指定按列进行广播
Out[89]:
       b  d  e
one   -1  0  1
two   -1  0  1
three -1  0  1
four  -1  0  1

在上面最后的例子中，我们通过axis='index'或者axis=0，在另外一个方向广播。

二、函数和映射

一些Numpy的通用函数对Pandas对象也有效：

In [91]: df = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),index = ['one','two','three','four'])
In [92]: df
Out[92]:
              b         d         e
one   -0.522310  0.636599  0.992393
two    0.572624 -0.451550 -1.935332
three  0.021926  0.056706 -0.267661
four  -2.718122 -0.740140 -1.565448
In [93]: np.abs(df)
Out[93]:
              b         d         e
one    0.522310  0.636599  0.992393
two    0.572624  0.451550  1.935332
three  0.021926  0.056706  0.267661
four   2.718122  0.740140  1.565448

当然，也可以自定义处理函数，然后使用pandas提供的apply方法，将它应用在每一列：

In [94]: f = lambda x: x.max() - x.min()
In [95]: df.apply(f)
Out[95]:
b    3.290745
d    1.376740
e    2.927725
dtype: float64

当然，可以指定按行应用f，只需要设置axis='columns'。也可以将引用函数的返回值设计为一个Series，这样最终结果会是个DataFrame：

In [96]: df.apply(f, axis='columns')
Out[96]:
one      1.514703
two      2.507956
three    0.324367
four     1.977981
dtype: float64
In [97]: def f2(x):
    ...:     return pd.Series([x.min(),x.max()], index=['min','max'])

In [98]: df.apply(f2)
Out[98]:
            b         d         e
min -2.718122 -0.740140 -1.935332
max  0.572624  0.636599  0.992393

还有更细粒度的apply方法，也就是DataFrame的applymap以及Series的map。它们逐一对每个元素进行操作，而不是整行整列的操作。请体会下面的例子：

In [99]: f3 = lambda x: '%.2f' % x
In [100]: df.applymap(f3)
Out[100]:
           b      d      e
one    -0.52   0.64   0.99
two     0.57  -0.45  -1.94
three   0.02   0.06  -0.27
four   -2.72  -0.74  -1.57

In [101]: df['d'].map(f3) # 获取d列，这是一个Series
Out[101]:
one       0.64
two      -0.45
three     0.06
four     -0.74
Name: d, dtype: object