Pandas学习笔记

1. 数据结构

Pandas主要有三种数据：

Series（一维数据，大小不可变）
DataFrame（二维数据，大小可变）
Panel（三维数据，大小可变）

Series

具有均匀数据的一维数组结构。例如1,3,5,7,...的集合

...

关键点

均匀数据
尺寸大小不变
数据值可变

DataFrame

具有异构数据的二维数据。例如

姓名	年龄	性别
小明	20	男
小红	15	女
小刚	18	男

关键点

异构数据
大小可变
数据可变

Panel

具有异构数据的三维数据结构，可以说成是DataFrame的容器。

关键点

异构数据
大小可变
数据可变

2. Series

Series是能够保存任何类型的数据（整型，字符串，浮点数，python对象等）的一维标记数据。

构造函数

pandas.Series(data, index, dtype, copy)

参数	描述
data	数据采取各种形式，如：ndarray，list，constants
index	索引值必须是唯一的和散列的，与数据的长度相同。默认np.arange(n)如果没有索引被传递。
dtype	用于数据类型。如果没有，将推断数据类型。
copy	复制数据，默认为false

构建一个空的Series

1 import pandas as pd
2 s=pd.Series()
3 print(s)

输出

Series([], dtype: float64)

如果数据是ndarray，则传递的索引必须具有相同的长度。如果没有传递索引值，那么默认索引是（0 - n-1）

1 import pandas as pd
2 import numpy as np
3 data = np.array(['a','b','c','d'])
4 s = pd.Series(data)
5 print(s)

输出

0    a
1    b
2    c
3    d
dtype: object

1 import pandas as pd
2 import numpy as np
3 data = np.array(['a','b','c','d'])
4 s = pd.Series(data,index=[100,101,102,103])
5 print(s)

输出

100    a
101    b
102    c
103    d
dtype: object

从字典（dict）创建一个Series，没有指定索引，则使用字典键作为索引，如果指定索引则使用指定的索引值。

1 import pandas as pd
2 import numpy as np
3 data = {'a' : 0., 'b' : 1., 'c' : 2.}
4 s = pd.Series(data)
5 print(s)

输出

a    0.0
b    1.0
c    2.0
dtype: float64

1 import pandas as pd
2 import numpy as np
3 data = {'a' : 0., 'b' : 1., 'c' : 2.}
4 s = pd.Series(data,index=['b','c','d','a'])
5 print(s)

输出

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

从标量创建一个系列，如果数据是标量值，则必须提供索引。如果索引长度超过数据长度，则将重复该值以匹配索引的长度。

1 import pandas as pd
2 import numpy as np
3 s = pd.Series(5, index=[0, 1, 2, 3])
4 print(s)

输出

0    5
1    5
2    5
3    5
dtype: int64

从具有位置的Series中访问数据，Series中的数据可以使用类似访问ndarray中的数据来访问。

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s)
4 print(s[0])

输出

a    1
b    2
c    3
d    4
e    5
dtype: int64
1

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s[:3])

输出

a    1
b    2
c    3
dtype: int64

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s[-3:])

输出

c    3
d    4
e    5
dtype: int64

使用标签检索数据，通过索引标签获取和设置值。

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s['a'])

输出

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s[['a','c','d']])

输出

a    1
c    3
d    4
dtype: int64

如果不包含标签，则会出项异常。

3. DataFrame

pandas.DataFrame(data, index, columns, dtype, copy)

构造函数的参数：

参数	描述
data	数据采取各种形式，如：ndarray，series，map，lists，dict，constant和DataFrame。
index	对于行标签
columns	对于列标签
dtype	每列的数据类型
copy	默认值为False

创建一个空的DataFrame

1 import pandas as pd
2 df = pd.DataFrame()
3 print(df)

输出

Empty DataFrame
Columns: []
Index: []

从列表创建DataFrame

1 import pandas as pd
2 data = [1,2,3,4,5]
3 df = pd.DataFrame(data)
4 print(df)

输出

1 import pandas as pd
2 data = [['Alex',10],['Bob',12],['Clarke',13]]
3 df = pd.DataFrame(data,columns=['Name','Age'])
4 print(df)

输出

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13

1 import pandas as pd
2 data = [['Alex',10],['Bob',12],['Clarke',13]]
3 df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
4 print(df)

输出

     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0

从ndarray/Lists的字典来创建DataFrame，所有的ndarrays必须具有相同的长度，如果传递了索引，则索引的长度应等于数组的长度，如果没有则使用默认索引。

1 import pandas as pd
2 data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
3 df = pd.DataFrame(data)
4 print(df)

输出

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42

使用数组创建一个索引的DataFrame

1 import pandas as pd
2 data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
3 df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
4 print(df)

输出

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42

从列表创建DataFrame，字典和列表可作为输入数据传递以用来创建DataFrame，字典键默认为列名。

1 import pandas as pd
2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
3 df = pd.DataFrame(data)
4 print(df)

输出

   a   b     c
0  1   2   NaN
1  5  10  20.0

使用字典，行索引和列索引创建DataFrame

1 import pandas as pd
2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
3 df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
4 df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
5 print(df1)
6 print(df2)

输出

        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN

字典的Series可以传递形成一个DataFrame，得到的索引是所有Series索引的并集

1 import pandas as pd
2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
4 df = pd.DataFrame(d)
5 print(df)

输出

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

列选择

1 import pandas as pd
2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
4 df = pd.DataFrame(d)
5 print(df ['one'])

输出

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

列添加

 1 import pandas as pd
 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
 4 df = pd.DataFrame(d)
 5 print ("Adding a new column by passing as Series:")
 6 df['three']=pd.Series([10,20,30],index=['a','b','c'])
 7 print(df)
 8 print ("Adding a new column using the existing columns in DataFrame:")
 9 df['four']=df['one']+df['three']
10 print(df)

输出

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN

列删除

 1 import pandas as pd
 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
 3      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
 4      'three' : pd.Series([10,20,30], index=['a','b','c'])}
 5 df = pd.DataFrame(d)
 6 print ("Our dataframe is:")
 7 print(df)
 8 print ("Deleting the first column using DEL function:")
 9 del df['one']
10 print(df)
11 print ("Deleting another column using POP function:")
12 df.pop('two')
13 print(df)

输出

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN

行选择，添加和删除

 1 import pandas as pd
 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
 3      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
 4 df = pd.DataFrame(d)
 5 print(df)
 6 print('---------')
 7 print(df.loc['a'])
 8 print('---------')
 9 print(df.iloc[2])
10 print('---------')
11 print(df[2:4])
12 print('---------')
13 df2=pd.DataFrame([[5,6],[7,8]],index=['e','f'],columns=['one','two'])
14 df=df.append(df2)
15 print(df)
16 df=df.drop('a')
17 print('---------')
18 print(df)

输出

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
---------
one    1.0
two    1.0
Name: a, dtype: float64
---------
one    3.0
two    3.0
Name: c, dtype: float64
---------
   one  two
c  3.0    3
d  NaN    4
---------
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
e  5.0    6
f  7.0    8
---------
   one  two
b  2.0    2
c  3.0    3
d  NaN    4
e  5.0    6
f  7.0    8

4. Panel

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

参数	描述
data	数据采取各种形式，如：ndarray, series, map, lists, dict, constant和DataFrame
items	axis=0
major_axis	axis=1
minor_axis	axis=2
dtype	每列的数据类型
copy	复制数据

创建panel和选择数据

 1 print('--------creat an empty panel---------')
 2 import pandas as pd
 3 p=pd.Panel()
 4 print(p)
 5 print('-------------end---------------------')
 6 print('---creat an panel from 3D ndarray----')
 7 import pandas as pd
 8 import numpy as np
 9 data = np.random.rand(2,4,5)
10 p = pd.Panel(data)
11 print(p)
12 print('-------------end---------------------')
13 print('-creat an panel from dict(DataFrame)-')
14 import pandas as pd
15 import numpy as np
16 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
17         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
18 p = pd.Panel(data)
19 print(p)
20 print('-------------end---------------------')
21 print('-------select data from panel--------')
22 import pandas as pd
23 import numpy as np
24 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
25         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
26 p = pd.Panel(data)
27 print(p['Item1'])
28 print('-------------end---------------------')
29 print('-----select data use major_axis------')
30 import pandas as pd
31 import numpy as np
32 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
33         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
34 p = pd.Panel(data)
35 print(p.major_xs(1))
36 print('-------------end---------------------')
37 print('-----select data use minor_axis------')
38 import pandas as pd
39 import numpy as np
40 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
41         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
42 p = pd.Panel(data)
43 print(p.minor_xs(1))
44 print('-------------end---------------------')

输出

--------creat an empty panel---------
<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None
-------------end---------------------
---creat an panel from 3D ndarray----
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
-------------end---------------------
-creat an panel from dict(DataFrame)-
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
-------------end---------------------
-------select data from panel--------
          0         1         2
0 -0.960065 -1.114559 -0.296025
1 -0.382277 -0.585262  1.503437
2  1.315953 -0.350967 -0.711729
3  0.959712  0.800819 -0.673261
-------------end---------------------
-----select data use major_axis------
      Item1     Item2
0 -1.742578 -0.697723
1 -0.156266  0.003577
2  0.023405       NaN
-------------end---------------------
-----select data use minor_axis------
      Item1     Item2
0  1.103015  0.488929
1 -0.391214 -0.030208
2  1.783799  0.039654
3 -1.863803 -0.949056
-------------end---------------------

5. 基本功能

Series基本功能

属性或方法	描述
axes	返回行轴标签列表。
dtype	返回对象的数据类型。
empty	检查是否为空，返回布尔型。
ndim	返回底层数据的维数，默认定义：1。
size	返回基础数据中的元素数。
values	将Series作为ndarray放回。
head(n)	放回前n行。
tail(n)	放回最后n行。

 1 import pandas as pd
 2 import numpy as np
 3 s = pd.Series(np.random.randn(4))
 4 print(s)
 5 print('-------------')
 6 print("The axes are:")
 7 print(s.axes)
 8 print('-------------')
 9 print ("Is the Object empty?")
10 print(s.empty)
11 print('-------------')
12 print("The dimensions of the object:")
13 print(s.ndim)
14 print('-------------')
15 print("The size of the object:")
16 print(s.size)
17 print('-------------')
18 print("The actual data series is:")
19 print(s.values)
20 print('-------------')
21 print("The first two rows of the data series:")
22 print(s.head(2))
23 print('-------------')
24 print("The last two rows of the data series:")
25 print(s.tail(2))

输出

0   -1.478084
1    0.468882
2    0.394107
3    0.682990
dtype: float64
-------------
The axes are:
[RangeIndex(start=0, stop=4, step=1)]
-------------
Is the Object empty?
False
-------------
The dimensions of the object:
1
-------------
The size of the object:
4
-------------
The actual data series is:
[-1.47808355  0.46888222  0.3941075   0.68299036]
-------------
The first two rows of the data series:
0   -1.478084
1    0.468882
dtype: float64
-------------
The last two rows of the data series:
2    0.394107
3    0.682990
dtype: float64

DataFrame基本功能

属性或方法	描述
T	转置行和列。
axes	返回一个列，行轴标签和列轴标签作为唯一的成员。
dtypes	放回此对象中的数据类型。
empty	检查是否为空，返回布尔型。
ndim	轴/数组维度大小。
shape	返回表示DataFrame的维度的元组。
size	尺寸
values	ndarray表示返回。
head()	放回开头前n行。
tail()	返回最后n行。

 1 print('---------creat a DataFrame----------')
 2 import pandas as pd
 3 import numpy as np
 4 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
 5    'Age':pd.Series([25,26,25,23,30,29,23]),
 6    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 7 df = pd.DataFrame(d)
 8 print("Our data series is:")
 9 print(df)
10 print('----------------end-----------------')
11 print('--the transpose of the data series--')
12 print(df.T)
13 print('----------------end-----------------')
14 print('-----row and column axis labels-----')
15 print(df.axes)
16 print('----------------end-----------------')
17 print('---the data types of each column----')
18 print(df.dtypes)
19 print('----------------end-----------------')
20 print('---------is the object empty--------')
21 print(df.empty)
22 print('----------------end-----------------')
23 print('-----------the dimension------------')
24 print(df.ndim)
25 print('----------------end-----------------')
26 print('--------------the shape-------------')
27 print(df.shape)
28 print('----------------end-----------------')
29 print('--------------the shape-------------')
30 print(df.shape)
31 print('----------------end-----------------')
32 print('------total number of elements------')
33 print(df.size)
34 print('----------------end-----------------')
35 print('-------------actual data------------')
36 print(df.values)
37 print('----------------end-----------------')
38 print('-------first two rows of data-------')
39 print(df.head(2))
40 print('----------------end-----------------')
41 print('--------last two rows of data-------')
42 print(df.tail(2))
43 print('----------------end-----------------')

输出

---------creat a DataFrame----------
Our data series is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Minsu   29    4.60
6   Jack   23    3.80
----------------end-----------------
--the transpose of the data series--
           0      1      2     3      4      5     6
Name     Tom  James  Ricky   Vin  Steve  Minsu  Jack
Age       25     26     25    23     30     29    23
Rating  4.23   3.24   3.98  2.56    3.2    4.6   3.8
----------------end-----------------
-----row and column axis labels-----
[RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')]
----------------end-----------------
---the data types of each column----
Name       object
Age         int64
Rating    float64
dtype: object
----------------end-----------------
---------is the object empty--------
False
----------------end-----------------
-----------the dimension------------
2
----------------end-----------------
--------------the shape-------------
(7, 3)
----------------end-----------------
--------------the shape-------------
(7, 3)
----------------end-----------------
------total number of elements------
21
----------------end-----------------
-------------actual data------------
[['Tom' 25 4.23]
 ['James' 26 3.24]
 ['Ricky' 25 3.98]
 ['Vin' 23 2.56]
 ['Steve' 30 3.2]
 ['Minsu' 29 4.6]
 ['Jack' 23 3.8]]
----------------end-----------------
-------first two rows of data-------
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
----------------end-----------------
--------last two rows of data-------
    Name  Age  Rating
5  Minsu   29     4.6
6   Jack   23     3.8
----------------end-----------------

6. 描述性统计

函数	描述
sum()	返回所请求轴的值的总和，默认axis=0
mean()	返回平均值
std()	返回标准差
median()	所有值的中位数
mode()	值的模值
min()	最小值
max()	最大值
abs()	绝对值
prod()	数组元素的乘积
cumsum()	累计总和
cumprod()	累计乘积
describe()	计算统计信息的摘要，object-汇总字符串，number-汇总数字，all-汇总所有列

 1 print('--------creat a DataFrame---------')
 2 import pandas as pd
 3 import numpy as np
 4 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
 5    'Lee','David','Gasper','Betina','Andres']),
 6    'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
 7    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
 8 df = pd.DataFrame(d)
 9 print(df)
10 print('---------------end----------------')
11 print('---------------sum----------------')
12 print(df.sum())
13 print('---------------end----------------')
14 print(df.sum(1))
15 print('---------------end----------------')
16 print('--------------mean----------------')
17 print(df.mean())
18 print('---------------end----------------')
19 print('--------------std----------------')
20 print(df.std())
21 print('---------------end----------------')
22 print('------------describe--------------')
23 print(df.describe())
24 print('---------------end----------------')

输出

--------creat a DataFrame---------
      Name  Age  Rating
0      Tom   25    4.23
1    James   26    3.24
2    Ricky   25    3.98
3      Vin   23    2.56
4    Steve   30    3.20
5    Minsu   29    4.60
6     Jack   23    3.80
7      Lee   34    3.78
8    David   40    2.98
9   Gasper   30    4.80
10  Betina   51    4.10
11  Andres   46    3.65
---------------end----------------
---------------sum----------------
Name      TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object
---------------end----------------
0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64
---------------end----------------
--------------mean----------------
Age       31.833333
Rating     3.743333
dtype: float64
---------------end----------------
--------------std----------------
Age       9.232682
Rating    0.661628
dtype: float64
---------------end----------------
------------describe--------------
             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000
---------------end----------------

7. 函数应用

表合理函数应用：pipe()
行或列函数应用：apply()
元素函数应用：applymap()

通过将函数和适当数量的参数作为管道参数来执行自定义操作。

 1 import pandas as pd
 2 import numpy as np
 3 def adder(ele1,ele2):
 4     return ele1+ele2
 5 df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
 6 print(df)
 7 print('---------------end----------------')
 8 print(df.pipe(adder,2))
 9 print('---------------end----------------')
10 print(df.apply(np.mean))
11 print('---------------end----------------')
12 print(df.apply(np.mean,axis=1))
13 print('---------------end----------------')
14 print(df.apply(lambda x:x.max()-x.min()))
15 print('---------------end----------------')
16 print(df['col1'].map(lambda x:x*100))
17 print('---------------end----------------')
18 print(df.applymap(lambda x:x*100))
19 print('---------------end----------------')

输出

       col1      col2      col3
0  1.689749  0.959856  1.074871
1 -0.392017  0.001075  0.806392
2 -0.484529  0.635483  0.644830
3 -0.049649  0.113976 -0.220698
4  1.413197 -0.576231 -0.075871
---------------end----------------
       col1      col2      col3
0  3.689749  2.959856  3.074871
1  1.607983  2.001075  2.806392
2  1.515471  2.635483  2.644830
3  1.950351  2.113976  1.779302
4  3.413197  1.423769  1.924129
---------------end----------------
col1    0.435350
col2    0.226832
col3    0.445905
dtype: float64
---------------end----------------
0    1.241492
1    0.138483
2    0.265261
3   -0.052123
4    0.253698
dtype: float64
---------------end----------------
col1    2.174278
col2    1.536088
col3    1.295569
dtype: float64
---------------end----------------
0    168.974915
1    -39.201732
2    -48.452922
3     -4.964864
4    141.319700
Name: col1, dtype: float64
---------------end----------------
         col1       col2        col3
0  168.974915  95.985614  107.487138
1  -39.201732   0.107497   80.639193
2  -48.452922  63.548250   64.483009
3   -4.964864  11.397646  -22.069797
4  141.319700 -57.623138   -7.587075
---------------end----------------

8. 重建索引

重新索引会更改DataFrame的行标签和列标签，重新索引意味着符合数据以匹配特定轴上的一组给定的标签。

重新排序现有数据以匹配一组新的标签
在没有标签数据的标签位置插入缺失值（NA）标记

 1 import pandas as pd
 2 import numpy as np
 3 N=20
 4 df = pd.DataFrame({
 5    'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
 6    'x': np.linspace(0,stop=N-1,num=N),
 7    'y': np.random.rand(N),
 8    'C': np.random.choice(['Low','Medium','High'],N).tolist(),
 9    'D': np.random.normal(100, 10, size=(N)).tolist()
10 })
11 df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])
12 print(df_reindexed)

输出

           A       C   B
0 2016-01-01    High NaN
2 2016-01-03  Medium NaN
5 2016-01-06  Medium NaN

重建索引与其他对象对齐

1 import pandas as pd
2 import numpy as np
3 df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
4 df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
5 df1 = df1.reindex_like(df2)
6 print(df1)

输出

       col1      col2      col3
0  0.533272  1.462343  1.958989
1  0.822496  1.020661 -0.958452
2  0.583271  1.100357  0.405649
3 -0.617700 -0.444208  0.921092
4 -0.883714 -0.068178  1.507545
5 -0.696816  0.729113 -0.509259
6 -0.127911 -0.255686 -1.378398

填充时重新加注

pad/ffill - 向前填充值
bfill/backfill - 向后填充值
nearest - 从最近的索引值填充

1 import pandas as pd
2 import numpy as np
3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
4 df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
5 print(df2.reindex_like(df1))
6 print("Data Frame with Forward Fill:")
7 print(df2.reindex_like(df1,method='ffill'))

输出

       col1      col2      col3
0  0.518742  0.162080  1.606103
1 -0.355712  2.200266  1.072651
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill:
       col1      col2      col3
0  0.518742  0.162080  1.606103
1 -0.355712  2.200266  1.072651
2 -0.355712  2.200266  1.072651
3 -0.355712  2.200266  1.072651
4 -0.355712  2.200266  1.072651
5 -0.355712  2.200266  1.072651

重建索引时的填充限制，限制参数在重建索引时提供对填充的额外控制。

1 import pandas as pd
2 import numpy as np
3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
4 df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
5 print(df2.reindex_like(df1))
6 print("Data Frame with Forward Fill limiting to 1:")
7 print(df2.reindex_like(df1,method='ffill',limit=1))

输出

       col1      col2      col3
0  0.550406  0.220336 -0.733154
1  0.372353  0.978386  1.202727
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill limiting to 1:
       col1      col2      col3
0  0.550406  0.220336 -0.733154
1  0.372353  0.978386  1.202727
2  0.372353  0.978386  1.202727
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN

重命名

1 import pandas as pd
2 import numpy as np
3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
4 print(df1)
5 print("After renaming the rows and columns:")
6 print(df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))

输出

       col1      col2      col3
0  0.162944 -0.257846 -0.890368
1 -0.969776  1.685473 -1.330109
2 -1.271563 -0.375700  0.778564
3 -1.123660  0.849679  0.436355
4  0.321475  0.779693 -2.100270
5 -1.184636 -0.206975  0.941504
After renaming the rows and columns:
              c1        c2      col3
apple   0.162944 -0.257846 -0.890368
banana -0.969776  1.685473 -1.330109
durian -1.271563 -0.375700  0.778564
3      -1.123660  0.849679  0.436355
4       0.321475  0.779693 -2.100270
5      -1.184636 -0.206975  0.941504

9. 迭代

 1 import pandas as pd
 2 import numpy as np
 3 N=20
 4 df = pd.DataFrame({
 5     'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
 6     'x': np.linspace(0,stop=N-1,num=N),
 7     'y': np.random.rand(N),
 8     'C': np.random.choice(['Low','Medium','High'],N).tolist(),
 9     'D': np.random.normal(100, 10, size=(N)).tolist()
10     })
11 for col in df:
12     print(col)

输出

A
x
y
C
D

要遍历DataFrame中的行，可以使用以下函数

iteritems() - 迭代（key, value）对
iterrows() - 将行迭代为（索引，Series）对
itertuples() - 以namedtuples的形式迭代行

 1 import pandas as pd
 2 import numpy as np
 3 df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
 4 print('------------iteritems--------------')
 5 for key,value in df.iteritems():
 6     print(key,value)
 7 print('----------------end----------------')
 8 print('-------------iterrows--------------')
 9 for row_index,row in df.iterrows():
10     print(row_index,row)
11 print('----------------end----------------')
12 print('-------------itertuples------------')
13 for row in df.itertuples():
14     print(row)
15 print('----------------end----------------')

输出

------------iteritems--------------
col1 0   -0.453626
1   -1.555137
2    1.209289
3    0.238345
Name: col1, dtype: float64
col2 0   -0.309713
1   -0.018258
2    0.326646
3    1.584639
Name: col2, dtype: float64
col3 0   -1.746411
1    0.144020
2    0.932400
3   -0.848700
Name: col3, dtype: float64
----------------end----------------
-------------iterrows--------------
0 col1   -0.453626
col2   -0.309713
col3   -1.746411
Name: 0, dtype: float64
1 col1   -1.555137
col2   -0.018258
col3    0.144020
Name: 1, dtype: float64
2 col1    1.209289
col2    0.326646
col3    0.932400
Name: 2, dtype: float64
3 col1    0.238345
col2    1.584639
col3   -0.848700
Name: 3, dtype: float64
----------------end----------------
-------------itertuples------------
Pandas(Index=0, col1=-0.453625680715928, col2=-0.30971276978094636, col3=-1.7464111236386397)
Pandas(Index=1, col1=-1.5551365938912898, col2=-0.018257622785818713, col3=0.1440202346073698)
Pandas(Index=2, col1=1.2092886777094904, col2=0.3266461576970751, col3=0.9323998460902878)
Pandas(Index=3, col1=0.23834535595475798, col2=1.5846386089382405, col3=-0.8486996087036667)
----------------end----------------

10. 排序

sort_values()提供了mergeesort,heapsort和quicksort的配置。

 1 import pandas as pd
 2 import numpy as np
 3 unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
 4 print(unsorted_df)
 5 print('---------按标签排序----------')
 6 sorted_df=unsorted_df.sort_index()
 7 print(sorted_df)
 8 print('--------改变排序顺序---------')
 9 sorted_df = unsorted_df.sort_index(ascending=False)
10 print(sorted_df)
11 print('----------按列排序-----------')
12 sorted_df=unsorted_df.sort_index(axis=1)
13 print(sorted_df)
14 print('----------按值排序-----------')
15 sorted_df = unsorted_df.sort_values(by='col1')
16 print(sorted_df)

输出

       col2      col1
1  0.295840 -0.880007
4  0.151129  1.843255
6 -0.516764  0.195839
2 -0.040592  0.582046
3  1.806547 -0.760579
5 -1.366668  0.652985
9 -1.180956  1.198587
8 -1.621409 -0.555094
0  0.403722  0.296659
7  0.520232 -0.759177
---------按标签排序----------
       col2      col1
0  0.403722  0.296659
1  0.295840 -0.880007
2 -0.040592  0.582046
3  1.806547 -0.760579
4  0.151129  1.843255
5 -1.366668  0.652985
6 -0.516764  0.195839
7  0.520232 -0.759177
8 -1.621409 -0.555094
9 -1.180956  1.198587
--------改变排序顺序---------
       col2      col1
9 -1.180956  1.198587
8 -1.621409 -0.555094
7  0.520232 -0.759177
6 -0.516764  0.195839
5 -1.366668  0.652985
4  0.151129  1.843255
3  1.806547 -0.760579
2 -0.040592  0.582046
1  0.295840 -0.880007
0  0.403722  0.296659
----------按列排序-----------
       col1      col2
1 -0.880007  0.295840
4  1.843255  0.151129
6  0.195839 -0.516764
2  0.582046 -0.040592
3 -0.760579  1.806547
5  0.652985 -1.366668
9  1.198587 -1.180956
8 -0.555094 -1.621409
0  0.296659  0.403722
7 -0.759177  0.520232
----------按值排序-----------
       col2      col1
1  0.295840 -0.880007
3  1.806547 -0.760579
7  0.520232 -0.759177
8 -1.621409 -0.555094
6 -0.516764  0.195839
0  0.403722  0.296659
2 -0.040592  0.582046
5 -1.366668  0.652985
9 -1.180956  1.198587
4  0.151129  1.843255

11. 字符串和文本数据

函数	描述
lower()	将Series/Index中的字符串转换为小写
upper()	将Series/Index中的字符串转换为大写
len()	计算字符串长度
strip()	帮助从两侧的Series/索引中的每个字符串中删除空格
split()	用给定的模式拆分每个字符串
cat()	使用给定的分隔符连接Series/索引元素
get_dummies()	返回具有单热编码值的DataFrame
contains()	如果元素中包含子字符串，则返回每个元素的布尔值
replace(a,b)	将值a替换为值b
repeat()	重复每个元素指定的次数
count()	返回模式中每个元素的出现总数
startswith()	如果元素以模式开始，则返回true
endswith()	如果元素以模式结束，则返回true
find()	返回模式第一次出现的位置
findall()	返回模式的所有出现的列表
swapcase()	变换字母大小写
islower()	是否小写
isupper()	是否大写
isnumeric()	是否数字

12. 自定义显示选项

pd.get_option(param) #显示默认值
pd.set_option(param, value) #设置默认值
pd.reset_option(param) #重置默认值
pd.describe_option(param) #打印参数的描述
pd.option_context(param, value) #临时设置默认值，退出作用域自动销毁

参数	描述
"display.max_rows"	显示的最大行数
"display.max_columns"	显示的最大列数
"display.expand_frame_repr"	拉伸页面
"display.max_colwidth"	显示的最大列宽
"display.precision"	显示的十进制数的精度

13. 索引

.loc(,) #基于标签，第一个参数表示行，第二个参数表示列，参数--单标量、列表、范围标签
.iloc(,) #基于整数，第一个参数表示行，第二个参数表示列，参数--整数、整数列表、系列值
.ix(,) #混合方法