Pandas快速入门(深度学习入门2)

源地址为:http://pandas.pydata.org/pandas-docs/stable/10min.html#min

Pandas(Python Data Analysis Library) 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现,它是使Python成为强大而高效的数据分析环境的重要因素之一。

Pandas的数据结构:

Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以是不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。
 
Time- Series:以时间为索引的Series。
 
DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。
 
Panel :三维的数组,可以理解为DataFrame的容器。 
 

Pandas一般导入文件为:

In [69]:
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
 

对象建立

创建Series,在pandas中,Series是代labels的数组。 首先用list创建数据,pandas会自动创建labels

In [70]:
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[70]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
 

通过传递numpy数组,使用datetime索引和标记列来创建DataFrame:

In [71]:
dates = pd.date_range('20130101', periods=6)
dates
Out[71]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
In [72]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Out[72]:
 ABCD
2013-01-01 -0.234526 -0.758501 -0.609560 -0.541163
2013-01-02 -1.377782 -0.801207 2.163044 0.481614
2013-01-03 0.345764 0.031672 0.660530 0.240518
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732
2013-01-05 0.119948 0.932712 -0.375521 -1.901977
2013-01-06 1.153657 1.556955 -1.172664 -0.097324
 

通过传递可以转换为类似Series的对象的Dict来创建DataFrame。

In [73]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2
Out[73]:
 ABCDEF
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
 

显示其类型:

In [74]:
df2.dtypes
Out[74]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
 

查看数据

查看开始或结尾几行数据

In [75]:
df.head(2)
Out[75]:
 ABCD
2013-01-01 -0.234526 -0.758501 -0.609560 -0.541163
2013-01-02 -1.377782 -0.801207 2.163044 0.481614
In [76]:
df.tail(3)
Out[76]:
 ABCD
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732
2013-01-05 0.119948 0.932712 -0.375521 -1.901977
2013-01-06 1.153657 1.556955 -1.172664 -0.097324
 

查看索引、列名称、numpy数据

In [77]:
df.index
Out[77]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
In [78]:
df.columns
Out[78]:
Index(['A', 'B', 'C', 'D'], dtype='object')
In [79]:
df.values
Out[79]:
array([[-0.23452598, -0.75850137, -0.60955962, -0.54116297],
       [-1.37778166, -0.80120685,  2.16304408,  0.48161429],
       [ 0.34576422,  0.03167245,  0.66053017,  0.24051834],
       [-0.89690453, -1.61403402, -0.53950167, -1.5127315 ],
       [ 0.11994836,  0.93271156, -0.37552097, -1.90197732],
       [ 1.1536568 ,  1.55695533, -1.17266403, -0.097324  ]])
 

描述显示数据的快速统计摘要

In [80]:
df.describe()
Out[80]:
 ABCD
count 6.000000 6.000000 6.000000 6.000000
mean -0.148307 -0.108734 0.021055 -0.555177
std 0.904502 1.187138 1.207575 0.964228
min -1.377782 -1.614034 -1.172664 -1.901977
25% -0.731310 -0.790530 -0.592045 -1.269839
50% -0.057289 -0.363414 -0.457511 -0.319243
75% 0.289310 0.707452 0.401517 0.156058
max 1.153657 1.556955 2.163044 0.481614
 

数据转置:

In [81]:
df.T
Out[81]:
 2013-01-01 00:00:002013-01-02 00:00:002013-01-03 00:00:002013-01-04 00:00:002013-01-05 00:00:002013-01-06 00:00:00
A -0.234526 -1.377782 0.345764 -0.896905 0.119948 1.153657
B -0.758501 -0.801207 0.031672 -1.614034 0.932712 1.556955
C -0.609560 2.163044 0.660530 -0.539502 -0.375521 -1.172664
D -0.541163 0.481614 0.240518 -1.512732 -1.901977 -0.097324
 

数据按轴排序

In [82]:
df.sort_index(axis=1, ascending=False)
Out[82]:
 DCBA
2013-01-01 -0.541163 -0.609560 -0.758501 -0.234526
2013-01-02 0.481614 2.163044 -0.801207 -1.377782
2013-01-03 0.240518 0.660530 0.031672 0.345764
2013-01-04 -1.512732 -0.539502 -1.614034 -0.896905
2013-01-05 -1.901977 -0.375521 0.932712 0.119948
2013-01-06 -0.097324 -1.172664 1.556955 1.153657
 

按值排序

In [83]:
df.sort_values(by='B')
Out[83]:
 ABCD
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732
2013-01-02 -1.377782 -0.801207 2.163044 0.481614
2013-01-01 -0.234526 -0.758501 -0.609560 -0.541163
2013-01-03 0.345764 0.031672 0.660530 0.240518
2013-01-05 0.119948 0.932712 -0.375521 -1.901977
2013-01-06 1.153657 1.556955 -1.172664 -0.097324
 

选择

注意:对于标准的Python/Numpy方法,选择与设置元素是直观并且易于交互的。但,对于生成代码,建议采用Pandas方法, .at, .iat, .loc, .iloc and .ix.

 

获取数据

选择标记为‘A’的一列

In [84]:
df['A']
Out[84]:
2013-01-01   -0.234526
2013-01-02   -1.377782
2013-01-03    0.345764
2013-01-04   -0.896905
2013-01-05    0.119948
2013-01-06    1.153657
Freq: D, Name: A, dtype: float64
 

通过[]切分行:

In [85]:
df[0:3]
Out[85]:
 ABCD
2013-01-01 -0.234526 -0.758501 -0.609560 -0.541163
2013-01-02 -1.377782 -0.801207 2.163044 0.481614
2013-01-03 0.345764 0.031672 0.660530 0.240518
In [86]:
df['20130102':'20130104']
Out[86]:
 ABCD
2013-01-02 -1.377782 -0.801207 2.163044 0.481614
2013-01-03 0.345764 0.031672 0.660530 0.240518
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732
 

通过label选择数据

通过label获取对应数据

In [87]:
df.loc[dates[0]]
Out[87]:
A   -0.234526
B   -0.758501
C   -0.609560
D   -0.541163
Name: 2013-01-01 00:00:00, dtype: float64
 

通过列label选择多列数据

In [88]:
df.loc[:,['A','B']]
Out[88]:
 AB
2013-01-01 -0.234526 -0.758501
2013-01-02 -1.377782 -0.801207
2013-01-03 0.345764 0.031672
2013-01-04 -0.896905 -1.614034
2013-01-05 0.119948 0.932712
2013-01-06 1.153657 1.556955
 

通过行label与列label选择数据

In [89]:
df.loc['20130102':'20130104',['A','B']]
Out[89]:
 AB
2013-01-02 -1.377782 -0.801207
2013-01-03 0.345764 0.031672
2013-01-04 -0.896905 -1.614034
 

降低返回对象的维度

In [90]:
df.loc['20130102', ['A','B']]
Out[90]:
A   -1.377782
B   -0.801207
Name: 2013-01-02 00:00:00, dtype: float64
 

获得标量值

In [91]:
df.loc[dates[0], 'A']
Out[91]:
-0.23452598284858567
 

快速获得标量(同前一个方法)

In [92]:
df.at[dates[0], 'A']
Out[92]:
-0.23452598284858567
 

通过位置选择数据

通过传递整数位置选择数据

In [93]:
df.iloc[3]
Out[93]:
A   -0.896905
B   -1.614034
C   -0.539502
D   -1.512732
Name: 2013-01-04 00:00:00, dtype: float64
 

通过行与列的位置选择数据,类似于python与numpy

In [94]:
df.iloc[3:5, 0:2]
Out[94]:
 AB
2013-01-04 -0.896905 -1.614034
2013-01-05 0.119948 0.932712
 

通过list来选择特定行或列的元素

In [95]:
df.iloc[[0, 2, 4], [0, 2]]
Out[95]:
 AC
2013-01-01 -0.234526 -0.609560
2013-01-03 0.345764 0.660530
2013-01-05 0.119948 -0.375521
 

通过list选择行

In [96]:
df.iloc[1:3,:]
Out[96]:
 ABCD
2013-01-02 -1.377782 -0.801207 2.163044 0.481614
2013-01-03 0.345764 0.031672 0.660530 0.240518
 

通过list选择列

In [97]:
df.iloc[:,1:3]
Out[97]:
 BC
2013-01-01 -0.758501 -0.609560
2013-01-02 -0.801207 2.163044
2013-01-03 0.031672 0.660530
2013-01-04 -1.614034 -0.539502
2013-01-05 0.932712 -0.375521
2013-01-06 1.556955 -1.172664
 

通过行列元素选择特定的元素

In [98]:
df.iloc[1, 1]
Out[98]:
-0.80120684660851138
 

快速访问某一个元素(同上一个方法)

In [99]:
df.iat[1,1]
Out[99]:
-0.80120684660851138
 

布尔型(条件)索引

用单列的值来选择元素

In [100]:
df[df.A > 0]
Out[100]:
 ABCD
2013-01-03 0.345764 0.031672 0.660530 0.240518
2013-01-05 0.119948 0.932712 -0.375521 -1.901977
2013-01-06 1.153657 1.556955 -1.172664 -0.097324
 

选择全部数据满足条件的数据

In [101]:
df[df > 0]
Out[101]:
 ABCD
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN 2.163044 0.481614
2013-01-03 0.345764 0.031672 0.660530 0.240518
2013-01-04 NaN NaN NaN NaN
2013-01-05 0.119948 0.932712 NaN NaN
2013-01-06 1.153657 1.556955 NaN NaN
 

通过isin()方法过滤数据

In [102]:
df2 = df.copy()
df2['E'] = ['one', 'two', 'three', 'four', 'five', 'six']
df2
Out[102]:
 ABCDE
2013-01-01 -0.234526 -0.758501 -0.609560 -0.541163 one
2013-01-02 -1.377782 -0.801207 2.163044 0.481614 two
2013-01-03 0.345764 0.031672 0.660530 0.240518 three
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732 four
2013-01-05 0.119948 0.932712 -0.375521 -1.901977 five
2013-01-06 1.153657 1.556955 -1.172664 -0.097324 six
In [103]:
df2[df2['E'].isin(['two','four'])]
Out[103]:
 ABCDE
2013-01-02 -1.377782 -0.801207 2.163044 0.481614 two
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732 four
 

设置数据

 

设置一个新列会根据索引自动对齐数据

In [104]:
s1 = pd.Series([1,2,3,4,5,6], index = pd.date_range('20130102', periods = 6))
s1
Out[104]:
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64
In [105]:
df['F'] = s1
df
Out[105]:
 ABCDF
2013-01-01 -0.234526 -0.758501 -0.609560 -0.541163 NaN
2013-01-02 -1.377782 -0.801207 2.163044 0.481614 1.0
2013-01-03 0.345764 0.031672 0.660530 0.240518 2.0
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732 3.0
2013-01-05 0.119948 0.932712 -0.375521 -1.901977 4.0
2013-01-06 1.153657 1.556955 -1.172664 -0.097324 5.0
 

通过label设置数据

In [106]:
df.at[dates[0],'A'] = 0
df
Out[106]:
 ABCDF
2013-01-01 0.000000 -0.758501 -0.609560 -0.541163 NaN
2013-01-02 -1.377782 -0.801207 2.163044 0.481614 1.0
2013-01-03 0.345764 0.031672 0.660530 0.240518 2.0
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732 3.0
2013-01-05 0.119948 0.932712 -0.375521 -1.901977 4.0
2013-01-06 1.153657 1.556955 -1.172664 -0.097324 5.0
 

通过位置设置数据

In [107]:
df.iat[0, 1] = 0
df
Out[107]:
 ABCDF
2013-01-01 0.000000 0.000000 -0.609560 -0.541163 NaN
2013-01-02 -1.377782 -0.801207 2.163044 0.481614 1.0
2013-01-03 0.345764 0.031672 0.660530 0.240518 2.0
2013-01-04 -0.896905 -1.614034 -0.539502 -1.512732 3.0
2013-01-05 0.119948 0.932712 -0.375521 -1.901977 4.0
2013-01-06 1.153657 1.556955 -1.172664 -0.097324 5.0
 

通过numpy设置整列数据

In [109]:
df.loc[:,'D'] = np.array([5] * len(df))
df
Out[109]:
 ABCDF
2013-01-01 0.000000 0.000000 -0.609560 5 NaN
2013-01-02 -1.377782 -0.801207 2.163044 5 1.0
2013-01-03 0.345764 0.031672 0.660530 5 2.0
2013-01-04 -0.896905 -1.614034 -0.539502 5 3.0
2013-01-05 0.119948 0.932712 -0.375521 5 4.0
2013-01-06 1.153657 1.556955 -1.172664 5 5.0
 

通过where操作设置数据

In [112]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2
Out[112]:
 ABCDF
2013-01-01 0.000000 0.000000 -0.609560 -5 NaN
2013-01-02 -1.377782 -0.801207 -2.163044 -5 -1.0
2013-01-03 -0.345764 -0.031672 -0.660530 -5 -2.0
2013-01-04 -0.896905 -1.614034 -0.539502 -5 -3.0
2013-01-05 -0.119948 -0.932712 -0.375521 -5 -4.0
2013-01-06 -1.153657 -1.556955 -1.172664 -5 -5.0
 

缺失数据

 

Pandas基本采用np.nan表示缺失数据,缺失数据默认不会参加运算。

 

重新索引允许您更改/添加/删除指定行或列上的索引。 这将返回数据的副本。

In [113]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
Out[113]:
 ABCDFE
2013-01-01 0.000000 0.000000 -0.609560 5 NaN 1.0
2013-01-02 -1.377782 -0.801207 2.163044 5 1.0 1.0
2013-01-03 0.345764 0.031672 0.660530 5 2.0 NaN
2013-01-04 -0.896905 -1.614034 -0.539502 5 3.0 NaN
 

舍去任何包含缺失数据的行

In [114]:
df1.dropna(how='any')
Out[114]:
 ABCDFE
2013-01-02 -1.377782 -0.801207 2.163044 5 1.0 1.0
 

填充缺失数据

In [116]:
df1.fillna(value=5)
Out[116]:
 ABCDFE
2013-01-01 0.000000 0.000000 -0.609560 5 5.0 1.0
2013-01-02 -1.377782 -0.801207 2.163044 5 1.0 1.0
2013-01-03 0.345764 0.031672 0.660530 5 2.0 5.0
2013-01-04 -0.896905 -1.614034 -0.539502 5 3.0 5.0
 

返回布尔数据,如果为nan返回true,否则返回false

In [117]:
pd.isnull(df1)
Out[117]:
 ABCDFE
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
 

数据操作

 

概率统计操作

 

这些操作一般不包括缺失数据

 

列方向均值操作:

In [118]:
df.mean()
Out[118]:
A   -0.109219
B    0.017683
C    0.021055
D    5.000000
F    3.000000
dtype: float64
 

行方向均值

In [119]:
df.mean(1)
Out[119]:
2013-01-01    1.097610
2013-01-02    1.196811
2013-01-03    1.607593
2013-01-04    0.989912
2013-01-05    1.935428
2013-01-06    2.307590
Freq: D, dtype: float64
 

操作具有不同维度和需要对齐的对象。 此外,Pandas会沿着指定的维度自动广播。

In [120]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s
Out[120]:
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
In [121]:
df.sub(s, axis='index')
Out[121]:
 ABCDF
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.654236 -0.968328 -0.339470 4.0 1.0
2013-01-04 -3.896905 -4.614034 -3.539502 2.0 0.0
2013-01-05 -4.880052 -4.067288 -5.375521 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
 

Apply

 

应用方程到数据

In [122]:
df.apply(np.cumsum)
Out[122]:
 ABCDF
2013-01-01 0.000000 0.000000 -0.609560 5 NaN
2013-01-02 -1.377782 -0.801207 1.553484 10 1.0
2013-01-03 -1.032017 -0.769534 2.214015 15 3.0
2013-01-04 -1.928922 -2.383568 1.674513 20 6.0
2013-01-05 -1.808974 -1.450857 1.298992 25 10.0
2013-01-06 -0.655317 0.106098 0.126328 30 15.0
In [123]:
df.apply(lambda x: x.max() - x.min())
Out[123]:
A    2.531438
B    3.170989
C    3.335708
D    0.000000
F    4.000000
dtype: float64
 

直方图

In [125]:
s = pd.Series(np.random.randint(0, 7, size=10))
s
Out[125]:
0    5
1    4
2    0
3    6
4    1
5    0
6    4
7    1
8    1
9    4
dtype: int32
In [126]:
s.value_counts()
Out[126]:
4    3
1    3
0    2
6    1
5    1
dtype: int64
 

字符串方法

 

Series在string中提供配置了一组字符串处理方法,可以方便地对数组的每个元素进行操作。

In [128]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s
Out[128]:
0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
In [129]:
s.str.lower()
Out[129]:
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object
 

合并数据

 

concat操作

 

Pandas提供一系列合并数据的操作,例如:Series, DataFrame, and Panel操作

In [133]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[133]:
 0123
0 -0.382355 -1.457945 -1.058879 1.106354
1 1.120769 -0.559993 -2.319053 1.404413
2 -0.382069 1.362170 0.289857 1.009210
3 -1.280824 1.158049 -0.356128 0.023975
4 0.797376 0.384624 0.152443 -1.392915
5 -1.886364 -0.276635 -1.159430 2.355427
6 -1.851654 0.781897 1.054916 0.323811
7 0.708013 1.648491 -0.031482 -1.245125
8 0.502539 -1.013665 -1.653738 0.748612
9 -0.773031 -1.085371 -1.070529 -0.190790
In [135]:
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
Out[135]:
 0123
0 -0.382355 -1.457945 -1.058879 1.106354
1 1.120769 -0.559993 -2.319053 1.404413
2 -0.382069 1.362170 0.289857 1.009210
3 -1.280824 1.158049 -0.356128 0.023975
4 0.797376 0.384624 0.152443 -1.392915
5 -1.886364 -0.276635 -1.159430 2.355427
6 -1.851654 0.781897 1.054916 0.323811
7 0.708013 1.648491 -0.031482 -1.245125
8 0.502539 -1.013665 -1.653738 0.748612
9 -0.773031 -1.085371 -1.070529 -0.190790
 

join操作

In [137]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
left
Out[137]:
 keylval
0 foo 1
1 foo 2
In [138]:
right
Out[138]:
 keyrval
0 foo 4
1 foo 5
In [139]:
pd.merge(left, right, on='key')
Out[139]:
 keylvalrval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
 

另一种方式

In [141]:
left= pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
left
Out[141]:
 keylval
0 foo 1
1 bar 2
In [142]:
right
Out[142]:
 keyrval
0 foo 4
1 bar 5
In [143]:
pd.merge(left, right, on='key')
Out[143]:
 keylvalrval
0 foo 1 4
1 bar 2 5
 

Append操作

In [145]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
Out[145]:
 ABCD
0 -0.789698 0.807736 -0.535839 1.633274
1 1.122605 0.550968 -0.337497 -0.287827
2 0.231271 0.273175 0.891461 -0.531991
3 0.902182 0.802540 -0.319774 0.131826
4 -1.999558 -0.114146 0.449653 0.174982
5 1.688851 0.789424 -0.194151 2.017002
6 0.738474 -0.682955 -1.420737 -0.978726
7 -0.234849 0.682447 0.753980 0.327666
In [146]:
s = df.iloc[3]
df.append(s, ignore_index=True)
Out[146]:
 ABCD
0 -0.789698 0.807736 -0.535839 1.633274
1 1.122605 0.550968 -0.337497 -0.287827
2 0.231271 0.273175 0.891461 -0.531991
3 0.902182 0.802540 -0.319774 0.131826
4 -1.999558 -0.114146 0.449653 0.174982
5 1.688851 0.789424 -0.194151 2.017002
6 0.738474 -0.682955 -1.420737 -0.978726
7 -0.234849 0.682447 0.753980 0.327666
8 0.902182 0.802540 -0.319774 0.131826
 

Group操作

 

group by操作可以分为如下操作:1.基于某些规则分组数据;

In [147]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df
Out[147]:
 ABCD
0 foo one 0.645354 0.110824
1 bar one 1.090199 -1.103622
2 foo two 0.492488 0.918727
3 bar three -1.938242 0.122475
4 foo two -0.416253 -1.056679
5 bar two -0.616998 0.482990
6 foo one 0.311077 0.641402
7 foo three -1.400911 -0.126283
 

2.对每组数据应用方程处理,例如:sum

In [148]:
df.groupby('A').sum()
Out[148]:
 CD
A  
bar -1.465042 -0.498157
foo -0.368245 0.487990
 

3.组合数据结果

In [149]:
df.groupby(['A','B']).sum()
Out[149]:
  CD
AB  
barone 1.090199 -1.103622
three -1.938242 0.122475
two -0.616998 0.482990
fooone 0.956431 0.752225
three -1.400911 -0.126283
two 0.076235 -0.137952
 

reshape

 

stack操作

In [151]:
 tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
                     ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
Out[151]:
  AB
firstsecond  
barone 0.482226 -1.085646
two 0.805270 1.785375
bazone -0.860945 -1.018662
two 0.921586 -0.216946
 

stack()方法压缩DataFrame列中数据

In [152]:
stacked = df2.stack()
stacked
Out[152]:
first  second   
bar    one     A    0.482226
               B   -1.085646
       two     A    0.805270
               B    1.785375
baz    one     A   -0.860945
               B   -1.018662
       two     A    0.921586
               B   -0.216946
dtype: float64
 

unstack()方法,是stack()方法的反操作,例如:

In [153]:
stacked.unstack()
Out[153]:
  AB
firstsecond  
barone 0.482226 -1.085646
two 0.805270 1.785375
bazone -0.860945 -1.018662
two 0.921586 -0.216946
In [154]:
stacked.unstack(1)
Out[154]:
 secondonetwo
first   
barA 0.482226 0.805270
B -1.085646 1.785375
bazA -0.860945 0.921586
B -1.018662 -0.216946
In [155]:
stacked.unstack(0)
Out[155]:
 firstbarbaz
second   
oneA 0.482226 -0.860945
B -1.085646 -1.018662
twoA 0.805270 0.921586
B 1.785375 -0.216946
 

数据透视表

In [156]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
Out[156]:
 ABCDE
0 one A foo -1.863191 -0.420588
1 one B foo 1.308556 1.373186
2 two C foo 0.969745 -0.244350
3 three A bar 0.308731 0.904415
4 one B bar 1.918195 -0.113832
5 one C bar 0.121828 -0.364853
6 two A foo 0.240318 -0.536804
7 three B foo -1.067721 0.743653
8 one C foo -0.837986 0.775367
9 one A bar -0.622907 1.653944
10 two B bar 0.221569 -2.451701
11 three C bar -2.312243 0.502830
 

可以获取透视数据:

In [157]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[157]:
 Cbarfoo
AB  
oneA -0.622907 -1.863191
B 1.918195 1.308556
C 0.121828 -0.837986
threeA 0.308731 NaN
B NaN -1.067721
C -2.312243 NaN
twoA NaN 0.240318
B 0.221569 NaN
C NaN 0.969745
 

Time Series(时间序列分析)

 

Pandas具有简单,强大和高效的功能,用于在频率转换期间执行重采样操作

In [158]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()
Out[158]:
2012-01-01    26325
Freq: 5T, dtype: int32
 

时区表示

In [159]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
Out[159]:
2012-03-06   -0.756171
2012-03-07   -1.818210
2012-03-08   -1.742229
2012-03-09   -0.666278
2012-03-10   -0.246013
Freq: D, dtype: float64
In [160]:
ts_utc = ts.tz_localize('UTC')
ts_utc
Out[160]:
2012-03-06 00:00:00+00:00   -0.756171
2012-03-07 00:00:00+00:00   -1.818210
2012-03-08 00:00:00+00:00   -1.742229
2012-03-09 00:00:00+00:00   -0.666278
2012-03-10 00:00:00+00:00   -0.246013
Freq: D, dtype: float64
 

转换成其他时区

In [161]:
ts_utc.tz_convert('US/Eastern')
Out[161]:
2012-03-05 19:00:00-05:00   -0.756171
2012-03-06 19:00:00-05:00   -1.818210
2012-03-07 19:00:00-05:00   -1.742229
2012-03-08 19:00:00-05:00   -0.666278
2012-03-09 19:00:00-05:00   -0.246013
Freq: D, dtype: float64
 

在时间跨度表示之间转换

In [162]:
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
Out[162]:
2012-01-31   -2.280217
2012-02-29   -1.265666
2012-03-31   -0.693234
2012-04-30   -0.160583
2012-05-31    0.394237
Freq: M, dtype: float64
In [163]:
ps = ts.to_period()
ps
Out[163]:
2012-01   -2.280217
2012-02   -1.265666
2012-03   -0.693234
2012-04   -0.160583
2012-05    0.394237
Freq: M, dtype: float64
In [164]:
ps.to_timestamp()
Out[164]:
2012-01-01   -2.280217
2012-02-01   -1.265666
2012-03-01   -0.693234
2012-04-01   -0.160583
2012-05-01    0.394237
Freq: MS, dtype: float64
 

在周期和时间戳之间进行转换,可以使用一些方便的数学函数。

In [165]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()
Out[165]:
1990-03-01 09:00   -0.116439
1990-06-01 09:00    0.689428
1990-09-01 09:00    1.603478
1990-12-01 09:00   -0.074146
1991-03-01 09:00   -0.642818
Freq: H, dtype: float64
 

Categorical Data

 

pandas自0.15版本后,Pandas包含了Categorical Data

In [166]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
In [167]:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Out[167]:
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
 

将Categorical Data重命名为更有意义的名称

In [168]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
In [169]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
Out[169]:
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
 

排序

In [170]:
df.sort_values(by="grade")
Out[170]:
 idraw_gradegrade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
 

按分类列分组还显示空类别:

In [171]:
df.groupby("grade").size()
Out[171]:
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64
 

Plotting

In [174]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
Out[174]:
<matplotlib.axes._subplots.AxesSubplot at 0x15444d8a518>
In [176]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
Out[176]:
<matplotlib.legend.Legend at 0x154451d2048>
 

获取数据

 

CSV

In [177]:
df.to_csv('foo.csv')
In [178]:
pd.read_csv('foo.csv')
Out[178]:
 Unnamed: 0ABCD
0 2000-01-01 -1.393111 1.514788 0.280721 -0.294551
1 2000-01-02 -0.642525 2.195259 0.592064 -0.057531
2 2000-01-03 -1.001919 1.979959 2.552074 1.107618
3 2000-01-04 0.949372 1.834230 2.081279 1.344826
4 2000-01-05 1.031375 2.840040 1.999606 2.436098
5 2000-01-06 0.709656 3.388670 1.952937 1.355730
6 2000-01-07 1.723083 3.503932 0.290802 1.950817
7 2000-01-08 2.374127 5.203101 -0.330689 3.287472
8 2000-01-09 2.674692 8.018739 -0.036404 3.521545
9 2000-01-10 2.630701 8.189795 0.385022 4.931897
10 2000-01-11 3.126444 7.579697 0.679853 5.769419
11 2000-01-12 2.709425 8.329440 -1.069793 5.963032
12 2000-01-13 2.860447 7.253739 -1.309274 5.421635
13 2000-01-14 3.314667 8.721476 -1.965677 4.064228
14 2000-01-15 2.349331 8.863615 -2.761195 4.930499
15 2000-01-16 1.958310 7.119070 -2.625528 2.593476
16 2000-01-17 2.070067 9.004844 -2.052658 2.760979
17 2000-01-18 1.965295 8.512214 -2.359618 0.699228
18 2000-01-19 0.900034 8.133227 -1.982716 1.624482
19 2000-01-20 0.035752 6.445181 -2.893294 1.594081
20 2000-01-21 -2.014831 6.620557 -4.773599 1.002027
21 2000-01-22 -0.923967 8.693741 -3.695795 0.558935
22 2000-01-23 -1.668225 7.790933 -3.573539 0.506860
23 2000-01-24 -2.056015 6.824887 -2.564427 -0.662262
24 2000-01-25 -1.843934 8.955545 -2.683112 -0.809258
25 2000-01-26 -1.856735 8.607138 -2.440229 -2.196696
26 2000-01-27 -3.463581 8.966359 -3.873806 -1.524501
27 2000-01-28 -4.788617 7.752132 -4.553026 -2.399321
28 2000-01-29 -5.976016 7.036011 -4.460207 -1.605071
29 2000-01-30 -5.034437 6.138244 -4.451524 -1.628326
... ... ... ... ... ...
970 2002-08-28 7.087637 -10.915543 -58.482667 -74.821127
971 2002-08-29 6.120060 -10.997083 -57.452650 -73.742471
972 2002-08-30 7.691062 -11.176013 -57.091866 -73.281751
973 2002-08-31 8.452480 -11.293687 -56.313169 -74.869274
974 2002-09-01 7.553114 -10.510296 -56.900910 -75.732503
975 2002-09-02 8.275944 -9.958087 -56.768495 -75.445624
976 2002-09-03 10.713038 -10.430143 -56.750073 -76.824079
977 2002-09-04 12.299215 -11.939397 -58.104176 -77.024297
978 2002-09-05 13.160027 -12.323100 -58.055834 -79.144961
979 2002-09-06 14.413937 -11.905493 -57.145252 -78.776055
980 2002-09-07 14.514871 -11.271294 -58.854309 -78.735910
981 2002-09-08 14.828057 -10.749718 -59.836208 -77.881782
982 2002-09-09 14.525107 -9.981579 -59.973495 -77.019424
983 2002-09-10 13.248493 -10.430729 -60.947579 -76.936804
984 2002-09-11 13.866722 -9.581368 -61.176041 -76.366701
985 2002-09-12 13.711199 -9.163406 -60.199849 -76.076018
986 2002-09-13 12.278210 -9.744253 -61.488643 -77.830263
987 2002-09-14 13.872331 -9.769701 -60.596225 -79.301200
988 2002-09-15 13.742700 -11.035063 -61.132231 -79.930390
989 2002-09-16 13.117977 -10.444062 -61.367571 -79.468517
990 2002-09-17 12.547621 -13.403313 -61.542260 -77.685127
991 2002-09-18 12.156448 -14.764817 -62.347422 -77.060493
992 2002-09-19 12.356847 -14.440572 -62.154593 -75.846297
993 2002-09-20 12.967553 -13.593848 -62.368733 -76.248747
994 2002-09-21 12.188960 -13.960158 -64.801744 -77.560436
995 2002-09-22 10.930343 -14.180484 -61.880778 -78.704844
996 2002-09-23 11.546164 -14.640839 -60.011920 -79.081217
997 2002-09-24 12.010131 -13.948313 -58.988687 -79.707070
998 2002-09-25 10.203590 -14.170849 -59.165740 -78.678488
999 2002-09-26 9.985623 -13.509717 -57.876236 -78.605853

1000 rows × 5 columns

 

HDF5

In [179]:
df.to_hdf('foo.h5','df')
In [180]:
 pd.read_hdf('foo.h5','df')
Out[180]:
 ABCD
2000-01-01 -1.393111 1.514788 0.280721 -0.294551
2000-01-02 -0.642525 2.195259 0.592064 -0.057531
2000-01-03 -1.001919 1.979959 2.552074 1.107618
2000-01-04 0.949372 1.834230 2.081279 1.344826
2000-01-05 1.031375 2.840040 1.999606 2.436098
2000-01-06 0.709656 3.388670 1.952937 1.355730
2000-01-07 1.723083 3.503932 0.290802 1.950817
2000-01-08 2.374127 5.203101 -0.330689 3.287472
2000-01-09 2.674692 8.018739 -0.036404 3.521545
2000-01-10 2.630701 8.189795 0.385022 4.931897
2000-01-11 3.126444 7.579697 0.679853 5.769419
2000-01-12 2.709425 8.329440 -1.069793 5.963032
2000-01-13 2.860447 7.253739 -1.309274 5.421635
2000-01-14 3.314667 8.721476 -1.965677 4.064228
2000-01-15 2.349331 8.863615 -2.761195 4.930499
2000-01-16 1.958310 7.119070 -2.625528 2.593476
2000-01-17 2.070067 9.004844 -2.052658 2.760979
2000-01-18 1.965295 8.512214 -2.359618 0.699228
2000-01-19 0.900034 8.133227 -1.982716 1.624482
2000-01-20 0.035752 6.445181 -2.893294 1.594081
2000-01-21 -2.014831 6.620557 -4.773599 1.002027
2000-01-22 -0.923967 8.693741 -3.695795 0.558935
2000-01-23 -1.668225 7.790933 -3.573539 0.506860
2000-01-24 -2.056015 6.824887 -2.564427 -0.662262
2000-01-25 -1.843934 8.955545 -2.683112 -0.809258
2000-01-26 -1.856735 8.607138 -2.440229 -2.196696
2000-01-27 -3.463581 8.966359 -3.873806 -1.524501
2000-01-28 -4.788617 7.752132 -4.553026 -2.399321
2000-01-29 -5.976016 7.036011 -4.460207 -1.605071
2000-01-30 -5.034437 6.138244 -4.451524 -1.628326
... ... ... ... ...
2002-08-28 7.087637 -10.915543 -58.482667 -74.821127
2002-08-29 6.120060 -10.997083 -57.452650 -73.742471
2002-08-30 7.691062 -11.176013 -57.091866 -73.281751
2002-08-31 8.452480 -11.293687 -56.313169 -74.869274
2002-09-01 7.553114 -10.510296 -56.900910 -75.732503
2002-09-02 8.275944 -9.958087 -56.768495 -75.445624
2002-09-03 10.713038 -10.430143 -56.750073 -76.824079
2002-09-04 12.299215 -11.939397 -58.104176 -77.024297
2002-09-05 13.160027 -12.323100 -58.055834 -79.144961
2002-09-06 14.413937 -11.905493 -57.145252 -78.776055
2002-09-07 14.514871 -11.271294 -58.854309 -78.735910
2002-09-08 14.828057 -10.749718 -59.836208 -77.881782
2002-09-09 14.525107 -9.981579 -59.973495 -77.019424
2002-09-10 13.248493 -10.430729 -60.947579 -76.936804
2002-09-11 13.866722 -9.581368 -61.176041 -76.366701
2002-09-12 13.711199 -9.163406 -60.199849 -76.076018
2002-09-13 12.278210 -9.744253 -61.488643 -77.830263
2002-09-14 13.872331 -9.769701 -60.596225 -79.301200
2002-09-15 13.742700 -11.035063 -61.132231 -79.930390
2002-09-16 13.117977 -10.444062 -61.367571 -79.468517
2002-09-17 12.547621 -13.403313 -61.542260 -77.685127
2002-09-18 12.156448 -14.764817 -62.347422 -77.060493
2002-09-19 12.356847 -14.440572 -62.154593 -75.846297
2002-09-20 12.967553 -13.593848 -62.368733 -76.248747
2002-09-21 12.188960 -13.960158 -64.801744 -77.560436
2002-09-22 10.930343 -14.180484 -61.880778 -78.704844
2002-09-23 11.546164 -14.640839 -60.011920 -79.081217
2002-09-24 12.010131 -13.948313 -58.988687 -79.707070
2002-09-25 10.203590 -14.170849 -59.165740 -78.678488
2002-09-26 9.985623 -13.509717 -57.876236 -78.605853

1000 rows × 4 columns

 

Excel

In [181]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')
In [182]:
 pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[182]:
 ABCD
2000-01-01 -1.393111 1.514788 0.280721 -0.294551
2000-01-02 -0.642525 2.195259 0.592064 -0.057531
2000-01-03 -1.001919 1.979959 2.552074 1.107618
2000-01-04 0.949372 1.834230 2.081279 1.344826
2000-01-05 1.031375 2.840040 1.999606 2.436098
2000-01-06 0.709656 3.388670 1.952937 1.355730
2000-01-07 1.723083 3.503932 0.290802 1.950817
2000-01-08 2.374127 5.203101 -0.330689 3.287472
2000-01-09 2.674692 8.018739 -0.036404 3.521545
2000-01-10 2.630701 8.189795 0.385022 4.931897
2000-01-11 3.126444 7.579697 0.679853 5.769419
2000-01-12 2.709425 8.329440 -1.069793 5.963032
2000-01-13 2.860447 7.253739 -1.309274 5.421635
2000-01-14 3.314667 8.721476 -1.965677 4.064228
2000-01-15 2.349331 8.863615 -2.761195 4.930499
2000-01-16 1.958310 7.119070 -2.625528 2.593476
2000-01-17 2.070067 9.004844 -2.052658 2.760979
2000-01-18 1.965295 8.512214 -2.359618 0.699228
2000-01-19 0.900034 8.133227 -1.982716 1.624482
2000-01-20 0.035752 6.445181 -2.893294 1.594081
2000-01-21 -2.014831 6.620557 -4.773599 1.002027
2000-01-22 -0.923967 8.693741 -3.695795 0.558935
2000-01-23 -1.668225 7.790933 -3.573539 0.506860
2000-01-24 -2.056015 6.824887 -2.564427 -0.662262
2000-01-25 -1.843934 8.955545 -2.683112 -0.809258
2000-01-26 -1.856735 8.607138 -2.440229 -2.196696
2000-01-27 -3.463581 8.966359 -3.873806 -1.524501
2000-01-28 -4.788617 7.752132 -4.553026 -2.399321
2000-01-29 -5.976016 7.036011 -4.460207 -1.605071
2000-01-30 -5.034437 6.138244 -4.451524 -1.628326
... ... ... ... ...
2002-08-28 7.087637 -10.915543 -58.482667 -74.821127
2002-08-29 6.120060 -10.997083 -57.452650 -73.742471
2002-08-30 7.691062 -11.176013 -57.091866 -73.281751
2002-08-31 8.452480 -11.293687 -56.313169 -74.869274
2002-09-01 7.553114 -10.510296 -56.900910 -75.732503
2002-09-02 8.275944 -9.958087 -56.768495 -75.445624
2002-09-03 10.713038 -10.430143 -56.750073 -76.824079
2002-09-04 12.299215 -11.939397 -58.104176 -77.024297
2002-09-05 13.160027 -12.323100 -58.055834 -79.144961
2002-09-06 14.413937 -11.905493 -57.145252 -78.776055
2002-09-07 14.514871 -11.271294 -58.854309 -78.735910
2002-09-08 14.828057 -10.749718 -59.836208 -77.881782
2002-09-09 14.525107 -9.981579 -59.973495 -77.019424
2002-09-10 13.248493 -10.430729 -60.947579 -76.936804
2002-09-11 13.866722 -9.581368 -61.176041 -76.366701
2002-09-12 13.711199 -9.163406 -60.199849 -76.076018
2002-09-13 12.278210 -9.744253 -61.488643 -77.830263
2002-09-14 13.872331 -9.769701 -60.596225 -79.301200
2002-09-15 13.742700 -11.035063 -61.132231 -79.930390
2002-09-16 13.117977 -10.444062 -61.367571 -79.468517
2002-09-17 12.547621 -13.403313 -61.542260 -77.685127
2002-09-18 12.156448 -14.764817 -62.347422 -77.060493
2002-09-19 12.356847 -14.440572 -62.154593 -75.846297
2002-09-20 12.967553 -13.593848 -62.368733 -76.248747
2002-09-21 12.188960 -13.960158 -64.801744 -77.560436
2002-09-22 10.930343 -14.180484 -61.880778 -78.704844
2002-09-23 11.546164 -14.640839 -60.011920 -79.081217
2002-09-24 12.010131 -13.948313 -58.988687 -79.707070
2002-09-25 10.203590 -14.170849 -59.165740 -78.678488
2002-09-26 9.985623 -13.509717 -57.876236 -78.605853

1000 rows × 4 columns

 
In [ ]:
 
In [ ]:
 
原文地址:https://www.cnblogs.com/hgl0417/p/7625928.html