Pandas入门学习笔记4

5 层次化索引

层次化索引是pandas的重要功能。以低维度的形式处理高维度数据。

In [185]: data = Series(np.random.randn(10),index=[list('aaabbbccdd'),[1,2,3,1,2,3,2,3,2,3]])

In [186]: data
Out[186]:
a  1    0.458553
   2    0.077532
   3   -1.561180
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
d  2    1.131352
   3   -1.292136
dtype: float64

获取索引。

In [187]: data.index
Out[187]:
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 1, 2, 1, 2]])

In [188]: data['b']
Out[188]:
1    2.498391
2    0.243617
3   -0.818542
dtype: float64

In [189]: data['b':'c']
Out[189]:
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
dtype: float64

In [190]: data[:,2]  # 获取内层索引
Out[190]:
a    0.077532
b    0.243617
c   -1.222213
d    1.131352
dtype: float64


In [191]: data.unstack()  # unstack来重新安排到dataframe中。
Out[191]:
          1         2         3
a  0.458553  0.077532 -1.561180
b  2.498391  0.243617 -0.818542
c       NaN -1.222213 -0.797079
d       NaN  1.131352 -1.292136

In [192]: data.unstack().stack()  # 逆运算--stack
Out[192]:
a  1    0.458553
   2    0.077532
   3   -1.561180
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
d  2    1.131352
   3   -1.292136
dtype: float64

DataFrame每条轴都可以分层索引。

5.1 重排分级顺序

可以重排调整某条轴上的索引顺序，swaplevel可以互换两个索引值，并范围一个新的对象。

In [193]: frame = DataFrame(np.random.randn(4,3),index=[list('aabb'),[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
     ...:

In [194]: frame
Out[194]:
         Ohio            Colorado
        Green       Red     Green
a 1  0.368997  0.670430  1.056365
  2 -0.352259 -0.656101  0.018544
b 1 -0.574535 -0.531988  0.295466
  2 -0.973587  0.225511 -0.250887

In [198]: frame.index.names = ['key1','key2']

In [199]: frame.columns.names = ['state','color']

In [200]: frame
Out[200]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
     2    -0.352259 -0.656101  0.018544
b    1    -0.574535 -0.531988  0.295466
     2    -0.973587  0.225511 -0.250887

In [201]: frame.swaplevel('key1','key2')
Out[201]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
2    a    -0.352259 -0.656101  0.018544
1    b    -0.574535 -0.531988  0.295466
2    b    -0.973587  0.225511 -0.250887

In [202]: frame.sortlevel(1)
Out[202]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
b    1    -0.574535 -0.531988  0.295466
a    2    -0.352259 -0.656101  0.018544
b    2    -0.973587  0.225511 -0.250887

In [203]: frame.swaplevel(0,1)
Out[203]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
2    a    -0.352259 -0.656101  0.018544
1    b    -0.574535 -0.531988  0.295466
2    b    -0.973587  0.225511 -0.250887

In [204]: frame.swaplevel(0,1).sortlevel(0)
Out[204]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
     b    -0.574535 -0.531988  0.295466
2    a    -0.352259 -0.656101  0.018544
     b    -0.973587  0.225511 -0.250887

5.2 根据级别汇总统计

许多DataFrame和Series汇总和统计方法都有level选项，指定在某个轴。


In [205]: frame
Out[205]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
     2    -0.352259 -0.656101  0.018544
b    1    -0.574535 -0.531988  0.295466
     2    -0.973587  0.225511 -0.250887


In [207]: frame.sum(level='key2')
Out[207]:
state      Ohio            Colorado
color     Green       Red     Green
key2                               
1     -0.205538  0.138443  1.351831
2     -1.325846 -0.430590 -0.232343

In [209]: frame.sum(level='color',axis=1)
Out[209]:
color         Green       Red
key1 key2                    
a    1     1.425362  0.670430
     2    -0.333715 -0.656101
b    1    -0.279069 -0.531988
     2    -1.224474  0.225511

5.3 使用DataFrame的列

经常需要用DataFrame的列作为索引，或者希望将索引变成DataFrame的列。

In [210]: df = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one']*7,'d':[0,1,2,0,1,2,3]})

In [211]: df
Out[211]:
   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  one  0
4  4  3  one  1
5  5  2  one  2
6  6  1  one  3

In [212]: df2 = df.set_index(['c','d']) #默认情况下，会将转换的这两列删除掉；

In [213]: df2
Out[213]:
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
    0  3  4
    1  4  3
    2  5  2
    3  6  1

    In [215]: df2 = df.set_index(['c','d'],drop=False) # 仍然保留这两列

    In [216]: df2
    Out[216]:
           a  b    c  d
    c   d              
    one 0  0  7  one  0
        1  1  6  one  1
        2  2  5  one  2
        0  3  4  one  0
        1  4  3  one  1
        2  5  2  one  2
        3  6  1  one  3

用reset_index可以将索引合并到DataFrame中。

In [217]: df2 = df.set_index(['c','d'])

In [218]: df2
Out[218]:
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
    0  3  4
    1  4  3
    2  5  2
    3  6  1

In [219]: df2.reset_index()
Out[219]:
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  one  0  3  4
4  one  1  4  3
5  one  2  5  2
6  one  3  6  1

6 其他

6.1 整数索引

先看一个例子：我们很难判断是要通过位置还是通过标签的索引来获取数据。

In [220]: ser = Series(np.arange(3))

In [221]: ser
Out[221]:
0    0
1    1
2    2
dtype: int64

In [222]: ser[-1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...

这样对于使用字母索引的Series就不存在这个问题。

如果需要可靠的、不考虑索引类型的、基于位置的索引，可以使用：

Series：iget_value
DataFrame：irow和icol

新的版本有些变化：都是用iloc来通过位置准确获取。

In [231]: ser3 = Series(np.arange(3),index=[-5,1,3])

In [232]: ser3.iget_value(2)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: iget_value(i) is deprecated. Please use .iloc[i] or .iat[i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[232]: 2

In [236]: ser3.iloc[2]
Out[236]: 2

In [237]: ser3.iat[2]
Out[237]: 2

In [239]: frame = DataFrame(np.arange(6).reshape(3,2),index=[2,0,1])

In [241]: frame
Out[241]:
   0  1
2  0  1
0  2  3
1  4  5

In [242]: frame.irow(1)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[242]:
0    2
1    3
Name: 0, dtype: int64

In [243]: frame.icol(1)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: icol(i) is deprecated. Please use .iloc[:,i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[243]:
2    1
0    3
1    5
Name: 1, dtype: int64

In [245]: frame.iloc[1]  # 按行位置获取
Out[245]:
0    2
1    3
Name: 0, dtype: int64

In [246]: frame.iloc[:,1]  #按列位置获取
Out[246]:
2    1
0    3
1    5
Name: 1, dtype: int64

6.2 面板数据

Panel数据结构，可以看成是一个三维的DataFrame数据结构。
Panel中的每一项都是一个DataFrame。
同样使用堆积式（层次化索引的）的DataFrame可以表示一个panel。

In [247]: import pandas.io.data as web
/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py:35: FutureWarning:
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
  FutureWarning)

In [248]: web
Out[248]: <module 'pandas.io.data' from '/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py'>

In [249]: pdata = pd.Panel(dict((stk ,web.get_data_yahoo(stk,'1/1/2009','6/1/2012')) for stk in ['AAPL','GOOG','MSFT','DELL']))

In [250]: pdata
Out[250]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 868 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: Open to Adj Close

In [252]: pdata = pdata.swapaxes('items','minor')

In [253]: pdata['Adj Close']
Out[253]:
                 AAPL      DELL        GOOG       MSFT
Date                                                  
2009-01-02  11.808505  10.39902  160.499779  16.501303
...
2012-05-30  75.362333  12.14992  293.821674  25.878448
2012-05-31  75.174961  11.92743  290.140354  25.746145
2012-06-01  72.996726  11.67592  285.205295  25.093451

[868 rows x 4 columns]

In [256]: pdata.ix[:,'6/1/2012',:]  # ix扩展为三维
Out[256]:
            Open        High         Low       Close       Volume   Adj Close
AAPL  569.159996  572.650009  560.520012  560.989983  130246900.0   72.996726
DELL   12.150000   12.300000   12.045000   12.070000   19397600.0   11.675920
GOOG  571.790972  572.650996  568.350996  570.981000    6138700.0  285.205295
MSFT   28.760000   28.959999   28.440001   28.450001   56634300.0   25.093451

In [260]: pdata.ix[:,'5/30/2012':,:].to_frame()
Out[260]:
                        Open        High         Low       Close       Volume  
Date       minor                                                                
2012-05-30 AAPL   569.199997  579.989990  566.559990  579.169998  132357400.0   
           DELL    12.590000   12.700000   12.460000   12.560000   19787800.0   
           GOOG   588.161028  591.901014  583.530999  588.230992    3827600.0   
           MSFT    29.350000   29.480000   29.120001   29.340000   41585500.0   
2012-05-31 AAPL   580.740021  581.499985  571.460022  577.730019  122918600.0   
           DELL    12.530000   12.540000   12.330000   12.330000   19955600.0   
           GOOG   588.720982  590.001032  579.001013  580.860990    5958800.0   
           MSFT    29.299999   29.420000   28.940001   29.190001   39134000.0   
2012-06-01 AAPL   569.159996  572.650009  560.520012  560.989983  130246900.0   
           DELL    12.150000   12.300000   12.045000   12.070000   19397600.0   
           GOOG   571.790972  572.650996  568.350996  570.981000    6138700.0   
           MSFT    28.760000   28.959999   28.440001   28.450001   56634300.0   

                   Adj Close  
Date       minor              
2012-05-30 AAPL    75.362333  
           DELL    12.149920  
           GOOG   293.821674  
           MSFT    25.878448  
2012-05-31 AAPL    75.174961  
           DELL    11.927430  
           GOOG   290.140354  
           MSFT    25.746145  
2012-06-01 AAPL    72.996726  
           DELL    11.675920  
           GOOG   285.205295  
           MSFT    25.093451  

# 可以转化为DataFrame
In [261]: stacked = pdata.ix[:,'5/30/2012':,:].to_frame()

In [262]: stacked.to_panel() # 转化为panel
Out[262]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT