读书笔记6pandas简单使用

一、序列Series，很像numpy中的array数组，可以由列表、元组、字典、numpy中的array来初始化

>>> from pandas import Series
>>> s = Series([0.1, 1.2, 2.3, 3.4, 4.5])
>>> s
0 0.1
1 1.2
2 2.3
3 3.4
4 4.5
dtype: float64

2、序列也可以由标签组成，默认是由数字表示。

>>> s = Series([0.1, 1.2, 2.3, 3.4, 4.5], index = [’a’,’b’,’c’,’d’,’e’])
>>> s
a 0.1
b 1.2
c 2.3
d 3.4
e 4.5
dtype: float64

索引的话可以由数字、标签、真值表、切片

from pandas import Series
s = Series([0.1, 1.2, 2.3, 3.4, 4.5], index = ['a','b','c','d','e'])
s[1]
Out[36]:
1.2

from pandas import Series
s = Series([0.1, 1.2, 2.3, 3.4, 4.5], index = ['a','b','c','d','e'])
print s[1],'
'
print s[1:4],'
'
print s[s>3],'
'
print s[[1,2,3]]
1.2 

b    1.2
c    2.3
d    3.4
dtype: float64 

d    3.4
e    4.5
dtype: float64 

b    1.2
c    2.3
d    3.4
dtype: float64

二、序列的常用函数

1、head and tail来显示头部5行或末尾5行数据，也可以通过传递参数来修改显示的行数

from pandas import Series
s = Series([0.1, 1.2, 2.3, 3.4, 4.5], index = ['a','b','c','d','e'])
print s.head(),'
'
print s.head(2)

a    0.1
b    1.2
c    2.3
d    3.4
e    4.5
dtype: float64 

a    0.1
b    1.2
dtype: float64

2、isnull and notnull返回等长的序列，

3、describe返回序列的一些统计特性

from pandas import Series
import numpy as np
s=Series(np.arange(1.0,10))
s.describe()
Out[43]:
count    9.000000
mean     5.000000
std      2.738613
min      1.000000
25%      3.000000
50%      5.000000
75%      7.000000
max      9.000000
dtype: float64

4、unique and nunique，返回不重复的数据集或者重复的数据集

5、drop(labels) 删除制定标签的数据，dropna()是删除NaN数据

6、append(series) 添加数据

from pandas import Series
import numpy as np
s=Series(np.arange(1.0,10))
s2=Series([22,33,44,55])
print s.append(s2)

0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
0    22.0
1    33.0
2    44.0
3    55.0
dtype: float64

7、replace(series,values) 将series数据集中的数据替换成values数据集

注意：这个替换是将替换后的数据返回，而不是在原来的数据集上做替换

from pandas import Series
import numpy as np
s=Series(np.arange(1.0,10))
s2=Series([22,33,44,55])
s3=s.append(s2)
print s3.replace([2,5,8],[22,55,99])
s3

0     1.0
1    22.0
2     3.0
3     4.0
4    55.0
5     6.0
6     7.0
7    99.0
8     9.0
0    22.0
1    33.0
2    44.0
3    55.0
dtype: float64
Out[51]:
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
0    22.0
1    33.0
2    44.0
3    55.0
dtype: float64

8、update(series)用series来更新，只更新匹配上标签的数据

注意：是在原来数据集上做更新

>>> s1 = Series(arange(1.0,4.0),index=[’a’,’b’,’c’])
>>> s1
a 1
b 2
c 3
dtype: float64
>>> s2 = Series(-1.0 * arange(1.0,4.0),index=[’c’,’d’,’e’])
>>> s1.update(s2)
>>> s1
a 1
b 2
c -1
dtype: float64

9、数据框架，DataFrame，相当于array上的二维数组，区别于array数组的地方时它可以是不同数据类型的数据组合在一起

from pandas import DataFrame
a=np.array([[1,2],[3,4]]);
df=DataFrame(a)
df
Out[52]:
     0    1
0    1    2
1    3    4

>>> df = DataFrame(array([[1,2],[3,4]]),columns=[’a’,’b’])
>>> df
a b
0 1 2
1 3 4

也可以指定行标签和列标签

>>> df = DataFrame(array([[1,2],[3,4]]), columns=[’dogs’,’cats’], index=[’Alice’,’Bob’])
>>> df
dogs cats
Alice 1 2
Bob 3 4

10、也可以通过字典来初始化DataFrame

11、也可以指定列标签

>>> df = DataFrame(array([[1,2],[3,4]]), columns=[’dogs’,’cats’], index=[’Alice’,’Bob’])
>>> df
dogs cats
Alice 1 2
Bob 3 4

二、操作数据框架，工作目录中有一个excel文件可以用，我的是score.xlsx

1、读取数据

2、选择列可以直接是列名或者列明组成的列表

3、选择行可以是列标签或者列标签组成的列表,也可以是数字切片、真值表

from pandas import read_excel
score = read_excel('score.xlsx','Sheet1')
score[:1]
 

Out[20]:

	序号	english	math	chinese	physics	chemistry	biology
0	1501	56	65	89	45	87	98

from pandas import read_excel
score = read_excel('score.xlsx','Sheet1')
t=score[(score.english>60) & (score.english<70)]
t
 

Out[22]:

	序号	english	math	chinese	physics	chemistry	biology
2	1503	65	78	68	86	78	87
5	1506	64	67	82	76	78	73

4、选择行和列，需要使用ix[rowselector,colselector]

5、添加列跟字典用法差不多

>>> state_gdp_2012 = state_gdp[[’state’,’gdp_2012’]]
>>> state_gdp_2012.head()
state gdp_2012
0 Alabama 157272
1 Alaska 44732
2 Arizona 230641
3 Arkansas 93892
4 California 1751002
>>> state_gdp_2012[’gdp_growth_2012’] = state_gdp[’gdp_growth_2012’]
>>> state_gdp_2012.head()
state gdp_2012 gdp_growth_2012
0 Alabama 157272 1.2
1 Alaska 44732 1.1
2 Arizona 230641 2.6
3 Arkansas 93892 1.3

或者insert(location,column_name,series)

>>> state_gdp_2012 = state_gdp[[’state’,’gdp_2012’]]
>>> state_gdp_2012.insert(1,’gdp_growth_2012’,state_gdp[’gdp_growth_2012’])
>>> state_gdp_2012.head()
state gdp_growth_2012 gdp_2012
0 Alabama 1.2 157272
1 Alaska 1.1 44732
2 Arizona 2.6 230641
3 Arkansas 1.3 93892
4 California 3.5 1751002

6、修改数据

from pandas import read_excel
score = read_excel('score.xlsx','Sheet1')
print score[:3]
score.ix[0,'english']=90
print score[:3]
     序号  english  math  chinese  physics  chemistry  biology
0  1501       56    65       89       45         87       98
1  1502       45    65       89       78         98       89
2  1503       65    78       68       86         78       87
     序号  english  math  chinese  physics  chemistry  biology
0  1501       90    65       89       45         87       98
1  1502       45    65       89       78         98       89
2  1503       65    78       68       86         78       87

7、删除列，可以使用del关键字、pop(column) 方法、drop(list of columns,axis=1)

from pandas import Series
from pandas import read_excel
score = read_excel('score.xlsx','Sheet1')
scorecopy = score.copy()
print score[:2]
score.pop('biology')
print score[:2]

     序号  english  math  chinese  physics  chemistry  biology
0  1501       56    65       89       45         87       98
1  1502       45    65       89       78         98       89
     序号  english  math  chinese  physics  chemistry
0  1501       56    65       89       45         87
1  1502       45    65       89       78         98

8、 dropna 删除含有Nan的行或者列，and drop_duplicates

9、fillna(value=value )将所有的Nan数据替换成所附的值

>>> df = DataFrame(array([[1, nan],[nan, 2]]))
>>> df.columns = [’one’,’two’]
>>> replacements = {’one’:-1, ’two’:-2}
>>> df.fillna(value=replacements)
one two
0 1 -2
1 -1 2

10、sort

>>> df = DataFrame(array([[1, 3],[1, 2],[3, 2],[2,1]]), columns=[’one’,’two’])
>>> df.sort(columns=’one’)
one two
0 1 3
1 1 2
3 2 1
2 3 2

>>> df.sort(columns=[’one’,’two’], ascending=[0,1])
one two
2 3 2
3 2 1
1 1 2
0 1 3