pandas 基础操作更新

创建一个Series,同时让pandas自动生成索引列
创建一个DataFrame数据框
查看数据
数据的简单统计
数据的排序
选择数据（类似于数据库中sql语句）
数据合并
分组操作Groupby操作
reshape操作
- stack 与unstack 方法
pivot_table 透视表
至此，pandas的基础的使用介绍也就结束了，后续会有专题性质的分析，包括(字符串处理，apply的使用，数据合并，透视表，时间序列的分析）

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

创建一个Series,同时让pandas自动生成索引列

s = pd.Series([1,3,5,np.nan,6,8])

# 查看s
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

创建一个DataFrame数据框

### 创建一个DataFrame ，可以传入一个numpy array 可以自己构建索引以及列标
dates = pd.date_range('2018-11-01',periods=7)
#### 比如说生成一个时间序列，以20181101 为起始位置的，7个日期组成的时间序列，数据的类型为datetime64[ns]

dates

DatetimeIndex(['2018-11-01', '2018-11-02', '2018-11-03', '2018-11-04',
               '2018-11-05', '2018-11-06', '2018-11-07'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(7,4),index= dates,columns=list('ABCD'))
df
# 产生随机正态分布的数据，7行4列，分别对应的index的长度以及column的长度

	A	B	C	D
2018-11-01	-0.170364	-0.237541	0.529903	0.660073
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573
2018-11-03	-0.518426	0.730866	-1.033830	0.712624
2018-11-04	1.013527	0.270167	0.081805	0.178193
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208
2018-11-06	-0.030580	0.545561	1.091127	-0.131579
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027

### 同时用可以使用dict的实行创建DataFrame
df2 = pd.DataFrame({"A":1,
                   "B":"20181101",
                   'C':np.array([3]*4,dtype='int32'),
                   'D':pd.Categorical(['test','train','test','train']),
                   "E":1.5},
                  )
df2

	A	B	C	D	E
0	1	20181101	3	test	1.5
1	1	20181101	3	train	1.5
2	1	20181101	3	test	1.5
3	1	20181101	3	train	1.5

df2.dtypes
### 查看数据框中的数据类型,常见的数据类型还有时间类型以及float类型

A       int64
B      object
C       int32
D    category
E     float64
dtype: object

查看数据


# 比如说看前5行
df.head()

	A	B	C	D
2018-11-01	-0.170364	-0.237541	0.529903	0.660073
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573
2018-11-03	-0.518426	0.730866	-1.033830	0.712624
2018-11-04	1.013527	0.270167	0.081805	0.178193
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208

# 后4行
df.tail(4)

	A	B	C	D
2018-11-04	1.013527	0.270167	0.081805	0.178193
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208
2018-11-06	-0.030580	0.545561	1.091127	-0.131579
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027

# 查看DataFrame的索引
df.index

DatetimeIndex(['2018-11-01', '2018-11-02', '2018-11-03', '2018-11-04',
               '2018-11-05', '2018-11-06', '2018-11-07'],
              dtype='datetime64[ns]', freq='D')

# 查看DataFrame的列索引
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

# 查看DataFrame的数据,将DataFrame转化为numpy array 的数据形式
df.values

array([[-0.1703643 , -0.23754121,  0.52990284,  0.66007285],
       [-0.15844565, -0.48853537,  0.08296043, -1.91357255],
       [-0.51842554,  0.73086567, -1.03382969,  0.71262388],
       [ 1.01352712,  0.27016714,  0.08180539,  0.17819344],
       [-0.89749689, -0.01627937, -0.23499323,  0.08120819],
       [-0.03058032,  0.54556063,  1.09112723, -0.13157934],
       [-0.31334198, -0.68817881, -0.41775393,  0.85502652]])

数据的简单统计

# 可以使用describe函数对DataFrame中的数值型数据进行统计
df.describe()

	A	B	C	D
count	7.000000	7.000000	7.000000	7.000000
mean	-0.153590	0.016580	0.014174	0.063139
std	0.590144	0.527860	0.680939	0.945526
min	-0.897497	-0.688179	-1.033830	-1.913573
25%	-0.415884	-0.363038	-0.326374	-0.025186
50%	-0.170364	-0.016279	0.081805	0.178193
75%	-0.094513	0.407864	0.306432	0.686348
max	1.013527	0.730866	1.091127	0.855027

df2.describe()
### 对于其他的数据类型的数据describe函数会自动过滤掉

	A	C	E
count	4.0	4.0	4.0
mean	1.0	3.0	1.5
std	0.0	0.0	0.0
min	1.0	3.0	1.5
25%	1.0	3.0	1.5
50%	1.0	3.0	1.5
75%	1.0	3.0	1.5
max	1.0	3.0	1.5

### DataFrame 的转置，将列索引与行索引进行调换，行数据与列数进行调换
df.T

	2018-11-01 00:00:00	2018-11-02 00:00:00	2018-11-03 00:00:00	2018-11-04 00:00:00	2018-11-05 00:00:00	2018-11-06 00:00:00	2018-11-07 00:00:00
A	-0.170364	-0.158446	-0.518426	1.013527	-0.897497	-0.030580	-0.313342
B	-0.237541	-0.488535	0.730866	0.270167	-0.016279	0.545561	-0.688179
C	0.529903	0.082960	-1.033830	0.081805	-0.234993	1.091127	-0.417754
D	0.660073	-1.913573	0.712624	0.178193	0.081208	-0.131579	0.855027

df

	A	B	C	D
2018-11-01	-0.170364	-0.237541	0.529903	0.660073
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573
2018-11-03	-0.518426	0.730866	-1.033830	0.712624
2018-11-04	1.013527	0.270167	0.081805	0.178193
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208
2018-11-06	-0.030580	0.545561	1.091127	-0.131579
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027

数据的排序

df.sort_index(ascending=False)
### 降序，按照列进行降序，通过该索引列

	A	B	C	D
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027
2018-11-06	-0.030580	0.545561	1.091127	-0.131579
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208
2018-11-04	1.013527	0.270167	0.081805	0.178193
2018-11-03	-0.518426	0.730866	-1.033830	0.712624
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573
2018-11-01	-0.170364	-0.237541	0.529903	0.660073


print(df.sort_values(by=['B','A']))
#  默认是升序,可以选择多指排序，先照B，后排A，如果B中的数据一样，则按照A中的大小进行排序
df.sort_values(by='B')

                   A         B         C         D
2018-11-07 -0.313342 -0.688179 -0.417754  0.855027
2018-11-02 -0.158446 -0.488535  0.082960 -1.913573
2018-11-01 -0.170364 -0.237541  0.529903  0.660073
2018-11-05 -0.897497 -0.016279 -0.234993  0.081208
2018-11-04  1.013527  0.270167  0.081805  0.178193
2018-11-06 -0.030580  0.545561  1.091127 -0.131579
2018-11-03 -0.518426  0.730866 -1.033830  0.712624

	A	B	C	D
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573
2018-11-01	-0.170364	-0.237541	0.529903	0.660073
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208
2018-11-04	1.013527	0.270167	0.081805	0.178193
2018-11-06	-0.030580	0.545561	1.091127	-0.131579
2018-11-03	-0.518426	0.730866	-1.033830	0.712624

选择数据（类似于数据库中sql语句）

df['A']
# 取出单独的一列数据，等价于df.A

2018-11-01   -0.170364
2018-11-02   -0.158446
2018-11-03   -0.518426
2018-11-04    1.013527
2018-11-05   -0.897497
2018-11-06   -0.030580
2018-11-07   -0.313342
Freq: D, Name: A, dtype: float64

# 通过[]进行行选择切片
df[0:3]

	A	B	C	D
2018-11-01	-0.170364	-0.237541	0.529903	0.660073
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573
2018-11-03	-0.518426	0.730866	-1.033830	0.712624

# 同时对于时间索引而言，可以直接使用比如
df['2018-11-01':'2018-11-04']

	A	B	C	D
2018-11-01	-0.170364	-0.237541	0.529903	0.660073
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573
2018-11-03	-0.518426	0.730866	-1.033830	0.712624
2018-11-04	1.013527	0.270167	0.081805	0.178193

另外可以使用标签来选择


df.loc['2018-11-01']

A   -0.170364
B   -0.237541
C    0.529903
D    0.660073
Name: 2018-11-01 00:00:00, dtype: float64

#### 通过标签来进行多个轴上的进行选择
df.loc[:,["A","B"]] # 等价于df[["A","B"]]

	A	B
2018-11-01	-0.170364	-0.237541
2018-11-02	-0.158446	-0.488535
2018-11-03	-0.518426	0.730866
2018-11-04	1.013527	0.270167
2018-11-05	-0.897497	-0.016279
2018-11-06	-0.030580	0.545561
2018-11-07	-0.313342	-0.688179

df.loc["2018-11-01":"2018-11-03",["A","B"]]

	A	B
2018-11-01	-0.170364	-0.237541
2018-11-02	-0.158446	-0.488535
2018-11-03	-0.518426	0.730866

#### 获得一个标量数据
df.loc['2018-11-01','A']

-0.17036430076617162

通过位置获取数据

df.iloc[3]  # 获得第四行的数据

A    1.013527
B    0.270167
C    0.081805
D    0.178193
Name: 2018-11-04 00:00:00, dtype: float64

df.iloc[1:3,1:4]  #  与numpy中的ndarray类似

	B	C	D
2018-11-02	-0.488535	0.08296	-1.913573
2018-11-03	0.730866	-1.03383	0.712624

# 可以选取不连续的行或者列进行取值
df.iloc[[1,3],[1,3]]

	B	D
2018-11-02	-0.488535	-1.913573
2018-11-04	0.270167	0.178193

#  对行进行切片处理
df.iloc[1:3,:]

	A	B	C	D
2018-11-02	-0.158446	-0.488535	0.08296	-1.913573
2018-11-03	-0.518426	0.730866	-1.03383	0.712624

# 对列进行切片
df.iloc[:,1:4]

	B	C	D
2018-11-01	-0.237541	0.529903	0.660073
2018-11-02	-0.488535	0.082960	-1.913573
2018-11-03	0.730866	-1.033830	0.712624
2018-11-04	0.270167	0.081805	0.178193
2018-11-05	-0.016279	-0.234993	0.081208
2018-11-06	0.545561	1.091127	-0.131579
2018-11-07	-0.688179	-0.417754	0.855027

# 获取特定的值
df.iloc[1,3]

-1.9135725473596013

布尔值索引

# 使用单列的数据作为条件进行筛选
df[df.A>0]

	A	B	C	D
2018-11-04	1.013527	0.270167	0.081805	0.178193

 #很少用到，很少使用这种大范围的条件进行筛选
df[df>0]

	A	B	C	D
2018-11-01	NaN	NaN	0.529903	0.660073
2018-11-02	NaN	NaN	0.082960	NaN
2018-11-03	NaN	0.730866	NaN	0.712624
2018-11-04	1.013527	0.270167	0.081805	0.178193
2018-11-05	NaN	NaN	NaN	0.081208
2018-11-06	NaN	0.545561	1.091127	NaN
2018-11-07	NaN	NaN	NaN	0.855027

# 使用isin()方法过滤
df2.head()

	A	B	C	D	E
0	1	20181101	3	test	1.5
1	1	20181101	3	train	1.5
2	1	20181101	3	test	1.5
3	1	20181101	3	train	1.5

df2[df2['D'].isin(['test'])]

	A	B	C	D	E
0	1	20181101	3	test	1.5
2	1	20181101	3	test	1.5

设定数值（类似于sql update 或者add）

设定一个新的列

df['E'] = [1,2,3,4,5,6,7]

df

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	0.660073	1
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573	2
2018-11-03	-0.518426	0.730866	-1.033830	0.712624	3
2018-11-04	1.013527	0.270167	0.081805	0.178193	4
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	5
2018-11-06	-0.030580	0.545561	1.091127	-0.131579	6
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	7

通过标签设定新的值

df.loc['2018-11-01','E']= 10  # 第一行，E列的数据修改为10

df

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	0.660073	10
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573	2
2018-11-03	-0.518426	0.730866	-1.033830	0.712624	3
2018-11-04	1.013527	0.270167	0.081805	0.178193	4
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	5
2018-11-06	-0.030580	0.545561	1.091127	-0.131579	6
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	7

df.iloc[1,4]=5000  # 第二行第五列数据修改为5000

df

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	0.660073	10
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573	5000
2018-11-03	-0.518426	0.730866	-1.033830	0.712624	3
2018-11-04	1.013527	0.270167	0.081805	0.178193	4
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	5
2018-11-06	-0.030580	0.545561	1.091127	-0.131579	6
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	7

df3 =df.copy()
df3[df3<0]= -df3
df3  # 都变成非负数

	A	B	C	D	E
2018-11-01	0.170364	0.237541	0.529903	0.660073	10
2018-11-02	0.158446	0.488535	0.082960	1.913573	5000
2018-11-03	0.518426	0.730866	1.033830	0.712624	3
2018-11-04	1.013527	0.270167	0.081805	0.178193	4
2018-11-05	0.897497	0.016279	0.234993	0.081208	5
2018-11-06	0.030580	0.545561	1.091127	0.131579	6
2018-11-07	0.313342	0.688179	0.417754	0.855027	7

缺失值处理

df

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	0.660073	10
2018-11-02	-0.158446	-0.488535	0.082960	-1.913573	5000
2018-11-03	-0.518426	0.730866	-1.033830	0.712624	3
2018-11-04	1.013527	0.270167	0.081805	0.178193	4
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	5
2018-11-06	-0.030580	0.545561	1.091127	-0.131579	6
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	7

df['E']=[1,np.nan,2,np.nan,4,np.nan,6]

df.loc['2018-11-01':'2018-11-03','D']=np.nan

df

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	NaN	1.0
2018-11-02	-0.158446	-0.488535	0.082960	NaN	NaN
2018-11-03	-0.518426	0.730866	-1.033830	NaN	2.0
2018-11-04	1.013527	0.270167	0.081805	0.178193	NaN
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	4.0
2018-11-06	-0.030580	0.545561	1.091127	-0.131579	NaN
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	6.0

去掉缺失值的行

df4 = df.copy()

df4.dropna(how='any')

	A	B	C	D	E
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	4.0
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	6.0

df4.dropna(how='all')
# """DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)""" 
# aixs 轴0或者1 index或者columns
# how 方式
# thresh 超过阈值个数的缺失值
# subset 那些字段的处理
# inplace 是否直接在原数据框中的替换

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	NaN	1.0
2018-11-02	-0.158446	-0.488535	0.082960	NaN	NaN
2018-11-03	-0.518426	0.730866	-1.033830	NaN	2.0
2018-11-04	1.013527	0.270167	0.081805	0.178193	NaN
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	4.0
2018-11-06	-0.030580	0.545561	1.091127	-0.131579	NaN
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	6.0

对缺失值就行填充

df4.fillna(1000)

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	1000.000000	1.0
2018-11-02	-0.158446	-0.488535	0.082960	1000.000000	1000.0
2018-11-03	-0.518426	0.730866	-1.033830	1000.000000	2.0
2018-11-04	1.013527	0.270167	0.081805	0.178193	1000.0
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	4.0
2018-11-06	-0.030580	0.545561	1.091127	-0.131579	1000.0
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	6.0

对数据进行布尔值进行填充

pd.isnull(df4)

	A	B	C	D	E
2018-11-01	False	False	False	True	False
2018-11-02	False	False	False	True	True
2018-11-03	False	False	False	True	False
2018-11-04	False	False	False	False	True
2018-11-05	False	False	False	False	False
2018-11-06	False	False	False	False	True
2018-11-07	False	False	False	False	False

数据操作

#统计的工作一般情况下都不包含缺失值，
df4.mean() 
#  默认是对列进行求平均，沿着行方向也就是axis=0

A   -0.153590
B    0.016580
C    0.014174
D    0.245712
E    3.250000
dtype: float64

df4.mean(axis=1)
#  沿着列方向求每行的平均

2018-11-01    0.280499
2018-11-02   -0.188007
2018-11-03    0.294653
2018-11-04    0.385923
2018-11-05    0.586488
2018-11-06    0.368632
2018-11-07    1.087150
Freq: D, dtype: float64

 # 对于拥有不同维度，需要对齐的对象进行操作。Pandas会自动的沿着指定的维度进行广播：
s = pd.Series([1,3,4,np.nan,6,7,8],index=dates)
s

2018-11-01    1.0
2018-11-02    3.0
2018-11-03    4.0
2018-11-04    NaN
2018-11-05    6.0
2018-11-06    7.0
2018-11-07    8.0
Freq: D, dtype: float64

df4.sub(s,axis='index')

	A	B	C	D	E
2018-11-01	-1.170364	-1.237541	-0.470097	NaN	0.0
2018-11-02	-3.158446	-3.488535	-2.917040	NaN	NaN
2018-11-03	-4.518426	-3.269134	-5.033830	NaN	-2.0
2018-11-04	NaN	NaN	NaN	NaN	NaN
2018-11-05	-6.897497	-6.016279	-6.234993	-5.918792	-2.0
2018-11-06	-7.030580	-6.454439	-5.908873	-7.131579	NaN
2018-11-07	-8.313342	-8.688179	-8.417754	-7.144973	-2.0

df4

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	NaN	1.0
2018-11-02	-0.158446	-0.488535	0.082960	NaN	NaN
2018-11-03	-0.518426	0.730866	-1.033830	NaN	2.0
2018-11-04	1.013527	0.270167	0.081805	0.178193	NaN
2018-11-05	-0.897497	-0.016279	-0.234993	0.081208	4.0
2018-11-06	-0.030580	0.545561	1.091127	-0.131579	NaN
2018-11-07	-0.313342	-0.688179	-0.417754	0.855027	6.0

df4.apply(np.cumsum)

	A	B	C	D	E
2018-11-01	-0.170364	-0.237541	0.529903	NaN	1.0
2018-11-02	-0.328810	-0.726077	0.612863	NaN	NaN
2018-11-03	-0.847235	0.004789	-0.420966	NaN	3.0
2018-11-04	0.166292	0.274956	-0.339161	0.178193	NaN
2018-11-05	-0.731205	0.258677	-0.574154	0.259402	7.0
2018-11-06	-0.761786	0.804237	0.516973	0.127822	NaN
2018-11-07	-1.075128	0.116059	0.099219	0.982849	13.0

df4.apply(lambda x: x.max()-x.min())

A    1.911024
B    1.419044
C    2.124957
D    0.986606
E    5.000000
dtype: float64

统计个数与离散化

s = pd.Series(np.random.randint(0,7,size=15))
s

0     5
1     4
2     1
3     2
4     1
5     0
6     2
7     6
8     4
9     3
10    1
11    1
12    1
13    3
14    2
dtype: int32

s.value_counts()
# 统计元素的个数，并按照元素统计量进行排序，未出现的元素不会显示出来

1    5
2    3
4    2
3    2
6    1
5    1
0    1
dtype: int64

s.reindex(range(0,7))
# 按照固定的顺序输出元素的个数统计

0    5
1    4
2    1
3    2
4    1
5    0
6    2
dtype: int32

s.mode()
#  众数

0    1
dtype: int32

离散化

# 连续值转化为离散值，可以使用cut函数进行操作（bins based on vlaues） qcut （bins based on sample
# quantiles） 函数
arr = np.random.randint(0,20,size=15)  # 正态分布
arr

array([ 5, 18, 13, 16, 16,  1, 15, 11,  0, 17, 16, 18, 15, 12, 13])

factor = pd.cut(arr,3)
factor

[(-0.018, 6.0], (12.0, 18.0], (12.0, 18.0], (12.0, 18.0], (12.0, 18.0], ..., (12.0, 18.0], (12.0, 18.0], (12.0, 18.0], (6.0, 12.0], (12.0, 18.0]]
Length: 15
Categories (3, interval[float64]): [(-0.018, 6.0] < (6.0, 12.0] < (12.0, 18.0]]

pd.value_counts(factor)

(12.0, 18.0]     10
(-0.018, 6.0]     3
(6.0, 12.0]       2
dtype: int64

factor1 = pd.cut(arr,[-1,5,10,15,20])

pd.value_counts(factor1)

(15, 20]    6
(10, 15]    6
(-1, 5]     3
(5, 10]     0
dtype: int64

factor2 = pd.qcut(arr,[0,0.25,0.5,0.75,1])

pd.value_counts(factor2)

(11.5, 15.0]      5
(-0.001, 11.5]    4
(16.0, 18.0]      3
(15.0, 16.0]      3
dtype: int64

pandas 处理字符串（单独一个大的章节，这人不做详述）

数据合并

concat
merge（类似于sql数据库中的join）
append

首先看concat合并数据框

df = pd.DataFrame(np.random.randn(10,4))  #  10行列的标准正态分布数据框
df

	0	1	2	3
0	0.949746	-0.050767	1.478622	-0.239901
1	-0.297120	-0.562589	0.371837	1.180715
2	0.953856	0.492295	0.821156	-0.323328
3	0.016153	1.554225	-1.166304	-0.904040
4	0.204763	-0.951291	-1.317620	0.672900
5	2.241006	-0.925746	-1.961408	0.853367
6	2.217133	-0.430812	0.518926	1.741445
7	-0.571104	-0.437305	-0.902241	0.786231
8	-2.511387	0.523760	1.811622	-0.777296
9	0.252690	0.901952	0.619614	-0.006631

d1,d2,d3  = df[:3],df[3:7],df[7:]
d1,d2,d3

(          0         1         2         3
 0  0.949746 -0.050767  1.478622 -0.239901
 1 -0.297120 -0.562589  0.371837  1.180715
 2  0.953856  0.492295  0.821156 -0.323328,
           0         1         2         3
 3  0.016153  1.554225 -1.166304 -0.904040
 4  0.204763 -0.951291 -1.317620  0.672900
 5  2.241006 -0.925746 -1.961408  0.853367
 6  2.217133 -0.430812  0.518926  1.741445,
           0         1         2         3
 7 -0.571104 -0.437305 -0.902241  0.786231
 8 -2.511387  0.523760  1.811622 -0.777296
 9  0.252690  0.901952  0.619614 -0.006631)

pd.concat([d1,d2,d3])
#合并三个数据框，数据结构相同，通常合并相同结构的数据，数据框中的字段一致，类似于数据添加新的数据来源

	0	1	2	3
0	0.949746	-0.050767	1.478622	-0.239901
1	-0.297120	-0.562589	0.371837	1.180715
2	0.953856	0.492295	0.821156	-0.323328
3	0.016153	1.554225	-1.166304	-0.904040
4	0.204763	-0.951291	-1.317620	0.672900
5	2.241006	-0.925746	-1.961408	0.853367
6	2.217133	-0.430812	0.518926	1.741445
7	-0.571104	-0.437305	-0.902241	0.786231
8	-2.511387	0.523760	1.811622	-0.777296
9	0.252690	0.901952	0.619614	-0.006631

merge方式合并（数据库中的join）

left = pd.DataFrame({'key':['foo','foo'],"lval":[1,2]})
right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})

left

	key	lval
0	foo	1
1	foo	2

right

	key	rval
0	foo	4
1	foo	5

pd.merge(left,right,on='key')

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

left = pd.DataFrame({'key':['foo','bar'],"lval":[1,2]})
right = pd.DataFrame({'key':['foo','bar'],'rval':[4,5]})
pd.merge(left,right,on='key')

	key	lval	rval
0	foo	1	4
1	bar	2	5

left

	key	lval
0	foo	1
1	bar	2

right

	key	rval
0	foo	4
1	bar	5

Append方式合并数据

#  与concat 类似，常用的方法可以参考一下日子
df = pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
df

	A	B	C	D
0	1.825997	-0.331086	-0.067143	0.747226
1	-0.027497	0.861639	0.928621	-2.549617
2	-0.546645	-0.072253	-0.788483	0.484140
3	-0.472240	-1.776993	-1.647407	0.170596
4	-0.099453	0.380143	-0.890510	1.233741
5	0.351915	0.137522	-1.165938	1.128146
6	0.558442	-1.047060	-0.598197	-1.979876
7	0.067321	-1.037666	-1.140675	-0.098562

## 
d1 = df.iloc[3]
df.append(d1,ignore_index= True)

	A	B	C	D
0	1.825997	-0.331086	-0.067143	0.747226
1	-0.027497	0.861639	0.928621	-2.549617
2	-0.546645	-0.072253	-0.788483	0.484140
3	-0.472240	-1.776993	-1.647407	0.170596
4	-0.099453	0.380143	-0.890510	1.233741
5	0.351915	0.137522	-1.165938	1.128146
6	0.558442	-1.047060	-0.598197	-1.979876
7	0.067321	-1.037666	-1.140675	-0.098562
8	-0.472240	-1.776993	-1.647407	0.170596

分组操作Groupby操作

df = pd.DataFrame({"A":['foo','bar','foo','bar'],
                  "B":['one','one','two','three'],
                  "C":np.random.randn(4),
                  "D":np.random.randn(4)})
df

	A	B	C	D
0	foo	one	0.938910	0.505163
1	bar	one	0.660543	0.353860
2	foo	two	0.520309	1.157462
3	bar	three	-1.054927	0.290693

df.groupby('A').sum()

	C	D
A
bar	-0.394384	0.644553
foo	1.459219	1.662625

df.groupby('A').size()

A
bar    2
foo    2
dtype: int64

df.groupby(['A',"B"]).sum()

		C	D
A	B
bar	one	0.660543	0.353860
bar	three	-1.054927	0.290693
foo	one	0.938910	0.505163
foo	two	0.520309	1.157462

df.groupby(['A',"B"]).size()

A    B    
bar  one      1
     three    1
foo  one      1
     two      1
dtype: int64

reshape操作

tuples = list(zip(*[['bar','bar','baz','baz','foo','foo','qux','qux'],
                   ['one','two','one','two','one','two','one','two']]))

index = pd.MultiIndex.from_tuples(tuples,names=['first','second'])
df = pd.DataFrame(np.random.randn(8,2),index= index,columns=['A','B'])
df2 =  df[:4]

df2

		A	B
first	second
bar	one	0.510758	0.641370
bar	two	0.481230	-0.470894
baz	one	-0.076294	0.121247
baz	two	0.378507	-1.358932

df

		A	B
first	second
bar	one	0.510758	0.641370
bar	two	0.481230	-0.470894
baz	one	-0.076294	0.121247
baz	two	0.378507	-1.358932
foo	one	-0.873012	0.531595
foo	two	0.266968	-0.393124
qux	one	0.981866	1.205994
qux	two	0.265772	0.132489

stack 与unstack 方法

df2_stacked = df2.stack()  
#  将column也作为index

df2_stacked

first  second   
bar    one     A    0.510758
               B    0.641370
       two     A    0.481230
               B   -0.470894
baz    one     A   -0.076294
               B    0.121247
       two     A    0.378507
               B   -1.358932
dtype: float64

df2_stacked.unstack()  #  回复到原来的状态

		A	B
first	second
bar	one	0.510758	0.641370
bar	two	0.481230	-0.470894
baz	one	-0.076294	0.121247
baz	two	0.378507	-1.358932

df2_stacked

first  second   
bar    one     A    0.510758
               B    0.641370
       two     A    0.481230
               B   -0.470894
baz    one     A   -0.076294
               B    0.121247
       two     A    0.378507
               B   -1.358932
dtype: float64

df2_stacked.unstack(1)

	second	one	two
first
bar	A	0.510758	0.481230
bar	B	0.641370	-0.470894
baz	A	-0.076294	0.378507
baz	B	0.121247	-1.358932

df2_stacked.unstack(0)

	first	bar	baz
second
one	A	0.510758	-0.076294
one	B	0.641370	0.121247
two	A	0.481230	0.378507
two	B	-0.470894	-1.358932

pivot_table 透视表

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,                    'B' : ['A', 'B', 'C'] * 4,
                 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                  'D' : np.random.randn(12),
                 'E' : np.random.randn(12)})

df

	A	B	C	D	E
0	one	A	foo	0.006247	-0.894827
1	one	B	foo	1.653974	-0.340107
2	two	C	foo	-1.627485	-1.011403
3	three	A	bar	-0.716002	1.533422
4	one	B	bar	0.422688	-0.807675
5	one	C	bar	0.264818	0.249770
6	two	A	foo	0.643288	-1.166616
7	three	B	foo	0.348041	-0.659099
8	one	C	foo	1.593486	-1.098731
9	one	A	bar	-0.389344	0.919528
10	two	B	bar	-1.407450	1.269716
11	three	C	bar	-0.172672	0.883970

pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=np.mean)

	C	bar	foo
A	B
one	A	-0.389344	0.006247
	B	0.422688	1.653974
	C	0.264818	1.593486
three	A	-0.716002	NaN
	B	NaN	0.348041
	C	-0.172672	NaN
two	A	NaN	0.643288
	B	-1.407450	NaN
	C	NaN	-1.627485

pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=np.sum)

	C	bar	foo
A	B
one	A	-0.389344	0.006247
	B	0.422688	1.653974
	C	0.264818	1.593486
three	A	-0.716002	NaN
	B	NaN	0.348041
	C	-0.172672	NaN
two	A	NaN	0.643288
	B	-1.407450	NaN
	C	NaN	-1.627485

pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=np.mean,fill_value=0)

	C	bar	foo
A	B
one	A	-0.389344	0.006247
	B	0.422688	1.653974
	C	0.264818	1.593486
three	A	-0.716002	0.000000
	B	0.000000	0.348041
	C	-0.172672	0.000000
two	A	0.000000	0.643288
	B	-1.407450	0.000000
	C	0.000000	-1.627485

df1 = pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=np.mean,fill_value=0)

df1.index

MultiIndex(levels=[['one', 'three', 'two'], ['A', 'B', 'C']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['A', 'B'])

df1.stack()

A      B  C  
one    A  bar   -0.389344
          foo    0.006247
       B  bar    0.422688
          foo    1.653974
       C  bar    0.264818
          foo    1.593486
three  A  bar   -0.716002
          foo    0.000000
       B  bar    0.000000
          foo    0.348041
       C  bar   -0.172672
          foo    0.000000
two    A  bar    0.000000
          foo    0.643288
       B  bar   -1.407450
          foo    0.000000
       C  bar    0.000000
          foo   -1.627485
dtype: float64

df1.unstack()

C	bar			foo
B	A	B	C	A	B	C
A
one	-0.389344	0.422688	0.264818	0.006247	1.653974	1.593486
three	-0.716002	0.000000	-0.172672	0.000000	0.348041	0.000000
two	0.000000	-1.407450	0.000000	0.643288	0.000000	-1.627485

df1.unstack(1)

C	bar			foo
B	A	B	C	A	B	C
A
one	-0.389344	0.422688	0.264818	0.006247	1.653974	1.593486
three	-0.716002	0.000000	-0.172672	0.000000	0.348041	0.000000
two	0.000000	-1.407450	0.000000	0.643288	0.000000	-1.627485

df1.unstack(0)

C	bar			foo
A	one	three	two	one	three	two
B
A	-0.389344	-0.716002	0.00000	0.006247	0.000000	0.643288
B	0.422688	0.000000	-1.40745	1.653974	0.348041	0.000000
C	0.264818	-0.172672	0.00000	1.593486	0.000000	-1.627485

pandas 基础操作 更新

创建一个Series,同时让pandas自动生成索引列

创建一个DataFrame数据框

查看数据

数据的简单统计

数据的排序

选择数据（类似于数据库中sql语句）

另外可以使用标签来选择

通过位置获取数据

布尔值索引

设定数值（类似于sql update 或者add）

缺失值处理

数据操作

统计个数与离散化

pandas 处理字符串（单独一个大的章节，这人不做详述）

数据合并

首先看concat合并数据框

merge方式合并（数据库中的join）

Append方式合并数据

分组操作Groupby操作

reshape操作

stack 与unstack 方法

pivot_table 透视表

至此，pandas的基础的使用介绍也就结束了，后续会有专题性质的分析，包括(字符串处理，apply的使用，数据合并，透视表，时间序列的分析）

pandas 基础操作更新