Pandas库学习笔记

import pandas as pd

两个数据类型：Series,DataFrame

pandas是基于Numpy实现的扩展库，提供了高效地操作大型数据集所需的工具。

Series类型由一组数据和与之相关的数据索引组成。

In [4]: d=pd.Series(range(5)) #自动索引
In [5]: d
Out[5]: 
0    0
1    1
2    2
3    3
4    4
dtype: int64

In [6]: d=pd.Series(range(5),index=['a','b','c','d','e'])    #自定义索引
In [7]: d
Out[7]: 
a    0
b    1
c    2
d    3
e    4
dtype: int64

直接传入字典：

In [10]: d=pd.Series({'a':1,'b':2,'c':3})
In [11]: d
Out[11]: 
a    1
b    2
c    3
dtype: int64

或从ndarray类型创建：

In [16]: d=pd.Series(np.arange(5),index=np.arange(9,4,-1))
In [17]: d
Out[17]: 
9    0
8    1
7    2
6    3
5    4
dtype: int64

.index获取索引，.values获得数据值

DataFrame类型由共用相同索引的一组列构成，是一个表格行的数据类型，既有行索引，也有列索引，常用与表达二维数据。

1.从一维ndarray对象字典创建：

In [40]: d=pd.DataFrame(np.arange(20).reshape(4,5))

In [41]: d
Out[41]: 
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19

In [43]: dt={'one':pd.Series([1,2,3],index=['a','b','c']),
    ...:     'two':pd.Series([9,8,7,6],index=['a','b','c','d'])}

In [45]: d=pd.DataFrame(dt)

In [46]: d
Out[46]: 
   one  two
a  1.0    9
b  2.0    8
c  3.0    7
d  NaN    6

In [47]: pd.DataFrame(dt,index=['b','c','d'],columns=['two','three'])
Out[47]: 
   two three
b    8   NaN
c    7   NaN
d    6   NaN

2.从列表类型的字典创建：

In [50]: dt={'one':[1,2,3,4],'two':[9,8,7,6]}

In [51]: d=pd.DataFrame(dt,index=['a','b','c','d'])

In [52]: d
Out[52]: 
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

重新索引 .reindex()：

In [2]: d1={'name':['Alice','Bob','Tony'],
   ...:     'gender':['f','m','m'],
   ...:     'age':[18,20,25]}

In [5]: d=pd.DataFrame(d1,index=['c1','c2','c3'])
In [6]: d
Out[6]: 
    age gender   name
c1   18      f  Alice
c2   20      m    Bob
c3   25      m   Tony

In [7]: d=d.reindex(['c3','c2','c1'])
In [8]: d
Out[8]: 
    age gender   name
c3   25      m   Tony
c2   20      m    Bob
c1   18      f  Alice

In [9]: d=d.reindex(columns=['name','gender','age'])
In [10]: d
Out[10]: 
     name gender  age
c3   Tony      m   25
c2    Bob      m   20
c1  Alice      f   18

索引类型的常用方法：

In [11]: new1=d.columns.insert(3,'birthday')

In [12]: new1
Out[12]: Index([u'name', u'gender', u'age', u'birthday'], dtype='object')

In [17]: newd=d.reindex(columns=new1,fill_value='0101')

In [18]: newd
Out[18]: 
     name gender  age birthday
c3   Tony      m   25     0101
c2    Bob      m   20     0101
c1  Alice      f   18     0101

In [29]: newd.drop('c1')       #drop和delete的区别
Out[29]: 
    name gender  age birthday
c3  Tony      m   25     0101
c2   Bob      m   20     0101

In [32]: n=newd.index.delete(2)
In [33]: newd=newd.reindex(index=n)

In [34]: newd
Out[34]: 
    name gender  age birthday
c3  Tony      m   25     0101
c2   Bob      m   20     0101

.sort_index(axis=0,ascending=True) 根据索引进行排序，默认升序。

.sort_values()

基本的统计分析函数：

In [7]: b
Out[7]: 
    0   1   2   3   4
c   0   1   2   3   4
b   5   6   7   8   9
a  10  11  12  13  14
d  15  16  17  18  19

In [8]: b.describe()
Out[8]: 
               0          1          2          3          4
count   4.000000   4.000000   4.000000   4.000000   4.000000
mean    7.500000   8.500000   9.500000  10.500000  11.500000
std     6.454972   6.454972   6.454972   6.454972   6.454972
min     0.000000   1.000000   2.000000   3.000000   4.000000
25%     3.750000   4.750000   5.750000   6.750000   7.750000
50%     7.500000   8.500000   9.500000  10.500000  11.500000
75%    11.250000  12.250000  13.250000  14.250000  15.250000
max    15.000000  16.000000  17.000000  18.000000  19.000000

In [9]: type(b.describe())
Out[9]: pandas.core.frame.DataFrame

In [10]: b.describe().ix['max']
Out[10]: 
0    15.0
1    16.0
2    17.0
3    18.0
4    19.0
Name: max, dtype: float64

In [11]: b.describe()[2]
Out[11]: 
count     4.000000
mean      9.500000
std       6.454972
min       2.000000
25%       5.750000
50%       9.500000
75%      13.250000
max      17.000000
Name: 2, dtype: float64

数据的相关性：

　　协方差：

　　　　对于两个事物X,Y ,

　　　　如果他们的协方差>0,X和Y正相关;

　　　　协方差<0，X和Y负相关;

　　　　协方差=0,独立无关。

.cov()

　　Pearson相关系数：

　　　　.corr()