pandas数据处理攻略

首先熟悉numpy随机n维数组的生成方法(只列出常用的函数)：

np.random.random([3, 4])     #生成shape为[3, 4]的随机数组，随机数范围[0.0, 1.0)
np.random.rand(3,4,5)         #生成shape为[3, 4, 5]的随机数组，随机数范围[0.0, 1.0)
np.random.randn(3,4)          #生成shape为[3，4]的随机数组，其中样本符合标准正态分布

pandas两种典型数据结构及创建方式：

Series

In [4]: s = pd.Series([1,3,5,np.nan,6,8])

In [5]: s
Out[5]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame

通过np.random多维数组创建

In [6]: dates = pd.date_range('20130101', periods=6)

In [7]: dates
Out[7]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [9]: df
Out[9]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988

通过字典创建

In [10]: df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : pd.Categorical(["test","train","test","train"]),
                         'F' : 'foo' })

In [11]: df2
Out[11]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

pandas选取数据

df['2']     #一个参数是表示取某列
df[0: 2]    #范围参数表示取某一范围的行!注意与上面区别
df.loc['20130102' : '20130104', ['A', 'B']]    #通过标签名切取数据
df.iloc[3: 5, [0, 2]]     #通过索引值切取数据
df[df[] > 0]   #里面的df[]结合第一二条切取一定范围的数据
df.drop_duplicates(['pop', 'state'])   #去重：去除同时满足两列重复的冗余行
df = data[data['A' == 2]]    #删除所有在列'A'中等于2的行