基本功能

重新索引

pandas对象的一个重要反复是reindex，其作用是创建一个适应新索引的新对象。

In [136]: obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])

In [137]: obj2=obj.reindex(['a','b','c','d','e'])

In [138]: obj2
Out[138]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

如果某个索引值不存在，就引入缺失值

In [140]: obj.reindex(['a','b','c','d','e'],fill_value=0)
Out[140]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

对于时间序列这样的有序数据，重新索引时可能需要做一些插值处理，method选项即可达到目的

In [141]: obj3=Series(['blue','purple','yellow'],index=[0,2,4])

In [142]: obj3.reindex(range(6),method='ffill')
Out[142]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

reindex的(插值) method选项

fill或pad：前向填充（或搬运）值

bfill或backfill：后向填充（或搬运）值

对于DataFrame，reindex可以修改（行）索引、列，或2个都修改，如果只传入一个序列，则重新索引行

In [143]: frame=DataFrame(np.arange(9).reshape(3,3),index=['a','b','c'],columns=['one','two','three'])
     ...:

In [144]: frame
Out[144]:
   one  two  three
a    0    1      2
b    3    4      5
c    6    7      8

In [145]: frame2=frame.reindex(['a','b','c','d'])

In [146]: frame2
Out[146]:
   one  two  three
a  0.0  1.0    2.0
b  3.0  4.0    5.0
c  6.0  7.0    8.0
d  NaN  NaN    NaN

In [147]: states=['red','yellow','green']

In [148]: frame.reindex(columns=states)
Out[148]:
   red  yellow  green
a  NaN     NaN    NaN
b  NaN     NaN    NaN
c  NaN     NaN    NaN

In [149]: states=['red','one','two']

In [150]: frame.reindex(columns=states)
Out[150]:
   red  one  two
a  NaN    0    1
b  NaN    3    4
c  NaN    6    7

View Code

也可以同时对行和列进行重新索引，而插值则只能按行应用

In [158]: frame.reindex(index=['a','b','c','d'],method='ffill',columns=['one','two','three'])
Out[158]:
   one  two  three
a    0    1      2
b    3    4      5
c    6    7      8
d    6    7      8

利用ix的标签索引功能，重新索引任务可以变得简洁。

reindex函数的参数：

index：用作索引的新序列。既可以是index实例，也可以是其他序列型的Python数据结构。index会被完全使用，就像没有任何复制一样。

method：插值（填充）方式

fill_value：在重新索引过程中，需要填入缺失值时使用的替代值

limit：前向填充或后向填充的最大填充量

level：在MultiIndex的指定级别上匹配简单索引，否则取其子集

copy：默认为true，无论如何都复制，如果为False，则新旧相等就不复制

丢弃指定轴上的项

丢弃某条轴上的一个或多个项很简单，只需要有一个索引数组或列表即可。

In [162]: obj=Series(np.arange(5.),index=['a','b','c','d','e'])

In [163]: new_obj=obj.drop('c')

In [164]: new_obj
Out[164]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [165]: obj
Out[165]:
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

对于DataFrame，可以任意删除任意轴上的索引值：

In [166]: data=DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],column
     ...: s=['one','two','three','four'])

In [167]: data.drop(['Colorado','Ohio'])
Out[167]:
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

In [170]: data.drop('two',axis=1)
Out[170]:
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In [171]: data.drop(['two','four'],axis=1)
Out[171]:
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

索引、选取和过滤

Series索引的工作方式类似于numpy数组的索引，只不过Series的索引值不只是整数

In [173]: obj=Series(np.arange(4.),index=['a','b','c','d'])

In [174]: obj['a']
Out[174]: 0.0

In [175]: obj[1]
Out[175]: 1.0

In [176]: obj[['b','a','d']]
Out[176]:
b    1.0
a    0.0
d    3.0
dtype: float64

In [177]: obj[obj<2]
Out[177]:
a    0.0
b    1.0
dtype: float64

利用标签的切片运算和普通的Python切片运算不同，其末端是包含的：

In [180]: obj['b':'c']
Out[180]:
b    1.0
c    2.0
dtype: float64

In [181]: obj[2:3]
Out[181]:
c    2.0
dtype: float64

设置的方式也很简单：

In [182]: obj['b':'c']=5

In [183]: obj
Out[183]:
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

DataFrame进行索引就是获取一个或多个列：

In [184]: data
Out[184]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [185]: data['one']
Out[185]:
Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int32

这种索引方式有几个特殊情况。首先通过切片或布尔型数组进行取行：

In [187]: data[:2]
Out[187]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

In [188]: data[data['three']>5]
Out[188]:
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

另一种是通过布尔型DataFrame进行索引：

In [189]: data<5
Out[189]:
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

In [190]: data[data<5]=0

In [191]: data
Out[191]:
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

通过索引字段ix选取行和列的子集：

In [197]: data.ix['Colorado',['two','three']]
Out[197]:
two      5
three    6
Name: Colorado, dtype: int32

In [198]: data.ix[data.three>5,:3]

Out[198]:
          one two three
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14

DataFrame的索引选项：

类型：说明

obj[val] ：选取DataFrame的单个列或一组列。在一些特殊情况下会比较便利；布尔型数组、切片、布尔型DataFrame

obj.ix[val]：选取DataFrame的单个行或一组行

obj.ix[:,val]：选取单个列或列子集

obj.ix[val1,val2]：同时选取行和列

reindex方法：将一个或多个轴匹配到新索引

xs方法：根据标签选取单行或单列，并返回一个Series

icol、irow方法：根据整数位置选取单列或单行，并返回一个Series

get_value、set_value方法：根据行标签或列标签选取单个值。

算术运算和数据对齐

pandas最重要的一个功能是，它可以对不同索引的对象进行算术运算。在将对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集。

In [7]: s1=Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])

In [8]: s2=Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g'])

In [9]: s1+s2
Out[9]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

自动的数据对齐操作不重叠的索引处引入了NA值。缺失值会在算术运算中传播。

对于DataFrame，对齐操作会同时发生在行和列上：

In [15]: df1=DataFrame(np.arange(9).reshape(3,3),columns=list('bcd'),index=['one','two','three'])

In [16]: df2=DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['one','two','three','four'])

In [17]: df1+df2
Out[17]:
          b   c     d   e
four    NaN NaN   NaN NaN
one     0.0 NaN   3.0 NaN
three  12.0 NaN  15.0 NaN
two     6.0 NaN   9.0 NaN

在算术方法中填充值

在对不同的索引对象进行算术运算时，可能希望当一个对象中某个轴标签在另一个对象中找不到时，填充一个特殊值：

In [18]: df1=DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'))

In [19]: df2=DataFrame(np.arange(20).reshape(4,5),columns=list('abcde'))

In [20]: df1+df2
Out[20]:
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

使用df1的add方法，传入df2以及一个fill_value参数：

In [21]: df1.add(df2,fill_value=0)
Out[21]:
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0  11.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

与此类似，在对Series或DataFrame重新索引时，也可以指定一个填充值：

In [22]: df1.reindex(columns=df2.columns,fill_value=0)
Out[22]:
   a  b   c   d  e
0  0  1   2   3  0
1  4  5   6   7  0
2  8  9  10  11  0

算术方法：add（加法）、sub（减法）、div（除法）、mul（乘法）

DataFrame和Series之间的运算

跟numpy数组一样，DataFrame和Series之间的算术运算也是有明确规定的

In [23]: arr=np.arange(12).reshape(3,4)

In [24]: arr
Out[24]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [25]: arr[0]
Out[25]: array([0, 1, 2, 3])

In [26]: arr-arr[0]
Out[26]:
array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

这就叫广播。DataFrame和Series之间的运算也是如此。

In [27]: frame=DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['one','two','three','four'])

In [28]: series=frame.ix[0]

In [29]: frame
Out[29]:
       b   d   e
one    0   1   2
two    3   4   5
three  6   7   8
four   9  10  11

In [30]: series
Out[30]:
b    0
d    1
e    2
Name: one, dtype: int32

默认情况下，DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列，然后沿着行一直向下广播：

In [31]: frame-series
Out[31]:
       b  d  e
one    0  0  0
two    3  3  3
three  6  6  6
four   9  9  9

如果某个索引值在DataFrame的列或Series的索引中找不到，则参与运算的两个对象就会被重新索引以形成并集：

In [32]: series2=Series(range(3),index=['b','e','f'])

In [33]: frame+series2
Out[33]:
         b   d     e   f
one    0.0 NaN   3.0 NaN
two    3.0 NaN   6.0 NaN
three  6.0 NaN   9.0 NaN
four   9.0 NaN  12.0 NaN

如果希望匹配行且在列上广播，那么必须使用算术运算方法：

In [34]: series3=frame['d']

In [35]: frame
Out[35]:
       b   d   e
one    0   1   2
two    3   4   5
three  6   7   8
four   9  10  11

In [36]: series3
Out[36]:
one       1
two       4
three     7
four     10
Name: d, dtype: int32

In [37]: frame.sub(series3,axis=0)
Out[37]:
       b  d  e
one   -1  0  1
two   -1  0  1
three -1  0  1
four  -1  0  1

函数应用和映射

numpy的ufunc（元素级数组方法）也可以用于操作pandas对象

In [40]: frame=DataFrame(np.random.randn(4,3),columns=list('bde'),index=['one','two','three','four'])

In [41]: frame
Out[41]:
              b         d         e
one    0.341012  0.864333  0.228682
two   -0.838441  0.483385  1.747057
three  0.479806 -2.392724  0.385935
four   0.602425 -0.150350 -0.072265

In [42]: np.abs(frame)
Out[42]:
              b         d         e
one    0.341012  0.864333  0.228682
two    0.838441  0.483385  1.747057
three  0.479806  2.392724  0.385935
four   0.602425  0.150350  0.072265

In [43]: np.sum(frame)
Out[43]:
b    0.584803
d   -1.195355
e    2.289409
dtype: float64

另一个常见的操作是，将行数应用到各行或各列所形成的一维数组上。DataFrame的apply方法可以实现功能

In [44]: f=lambda x:x.max() - x.min()

In [45]: frame.apply(f)
Out[45]:
b    1.440866
d    3.257057
e    1.819322
dtype: float64

In [46]: frame.apply(f,axis=1)
Out[46]:
one      0.635651
two      2.585498
three    2.872530
four     0.752775
dtype: float64

许多常见的数组统计功能都被实现成DataFrame的方法，因此无需apply方法。

除了标量值外，传递给apply的函数还可以返回由多个值组成的series：

In [49]: def f(x):
    ...:     return Series([x.min(),x.max()],index=['min','max'])
    ...:
    ...:

In [50]: frame.apply(f)
Out[50]:
            b         d         e
min -0.838441 -2.392724 -0.072265
max  0.602425  0.864333  1.747057

排序和排名

根据条件对数据集排序也是一种重要的内置运算。要对行或列索引进行排序，可使用sort_index方法，它返回一个已排序的新对象：

In [51]: obj=Series(range(4),index=list('dabc'))

In [52]: obj.sort_index()
Out[52]:
a    1
b    2
c    3
d    0
dtype: int64

而对于DataFrame，则可以根据任意一个轴上的索引进行排序：

In [54]: frame
Out[54]:
              b         d         e
one    0.341012  0.864333  0.228682
two   -0.838441  0.483385  1.747057
three  0.479806 -2.392724  0.385935
four   0.602425 -0.150350 -0.072265

In [55]: frame.sort_index()
Out[55]:
              b         d         e
four   0.602425 -0.150350 -0.072265
one    0.341012  0.864333  0.228682
three  0.479806 -2.392724  0.385935
two   -0.838441  0.483385  1.747057

In [56]: frame.sort_index(axis=1)
Out[56]:
              b         d         e
one    0.341012  0.864333  0.228682
two   -0.838441  0.483385  1.747057
three  0.479806 -2.392724  0.385935
four   0.602425 -0.150350 -0.072265

数据默认是按升序排序的，也可以按照降序排序：

In [57]: frame.sort_index(axis=1,ascending=False)
Out[57]:
              e         d         b
one    0.228682  0.864333  0.341012
two    1.747057  0.483385 -0.838441
three  0.385935 -2.392724  0.479806
four  -0.072265 -0.150350  0.602425

在DataFrame上，如果需要根据一个或多个列中的值进行排序，将一个或多个列的名字传递给by选项即可：

In [71]: frame=DataFrame({'b':[4,7,-2,3],'a':[0,1,0,1]})

In [72]: frame
Out[72]:
   b  a
0  4  0
1  7  1
2 -2  0
3  3  1

In [73]: frame.sort_values(by='a')
Out[73]:
   b  a
0  4  0
2 -2  0
1  7  1
3  3  1

In [77]: frame.sort_index(by=['a','b'])
Out[77]:
   b  a
2 -2  0
0  4  0
3  3  1
1  7  1

排名跟排序关系密切，且它会增设一个排名值（从1开始，一直到数组中有效数据的数量）。它跟numpy.argsort产生的间接排序索引差不多，只不过它可以根据某种规则破坏平级关系，那就是rank方法。

rank方法是通过”为各组分配一个平均排名“的方式破坏平级关系：

In [78]: obj=Series([7,-5,7,4,2,0,4])

In [79]: obj.rank()
Out[79]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

也可以根据值在源数据中出现的顺序给出排名：

In [80]: obj.rank(method='first')
Out[80]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

也可以按照降序排名：

In [81]: obj.rank(method='max',ascending=False)
Out[81]:
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

排名用于破坏平级关系的method选项：

average：默认：在相等分组中，为各个值分配平均排名

min：使用整个分组的最小排名

max：使用整个分组最大排名

first：按值在原始数据中出现的顺序分配排名

带有重复值的轴索引

虽然很多pandas函数都要求标签唯一，但不是强制性的：

In [85]: obj=Series(range(5),index=list('aabbc'))

In [86]: obj
Out[86]:
a    0
a    1
b    2
b    3
c    4
dtype: int64

索引的is_unique属性可以告诉你它的值是否唯一

In [87]: obj.index.is_unique
Out[87]: False

对带有重复值的索引，数据选取的行为将会有些不同。如果某个索引对应多个值，则返回一个Series，而对应单个值得，则返回一个标量：

In [88]: obj['a']
Out[88]:
a    0
a    1
dtype: int64

In [89]: obj['c']
Out[89]: 4

对DataFrame的行进行索引时也是如此：

In [94]: df=DataFrame(np.arange(12).reshape(4,3),index=list('aabb'))

In [95]: df.ix['b']
Out[95]:
   0   1   2
b  6   7   8
b  9  10  11

In [96]: df
Out[96]:
   0   1   2
a  0   1   2
a  3   4   5
b  6   7   8
b  9  10  11