python pandas模

介绍

　　pandas是一个强大的Python数据分析的工具包。
　　pandas是基于NumPy构建的。
　　pandas的主要功能
　　　　具备对其功能的数据结构DataFrame、Series
　　　　集成时间序列功能
　　　　提供丰富的数学运算和操作
　　　　灵活处理缺失数据
　　安装方法：pip install pandas
　　引用方法：import pandas as pd

pandas：Series

　　Series是一种类似于一位数组的对象，由一组数据和一组与之相关的数据标签（索引）组成。
　　Series比较像列表（数组）和字典的结合体
　　创建方式：

　　　　pd.Series([4,7,-5,3])
　　　　pd.Series([4,7,-5,3],index=['a','b','c','d'])
　　　　pd.Series({'a':1, 'b':2})
　　　　pd.Series(0, index=['a','b','c','d'])
　　获取值数组和索引数组：values属性和index属性

pandas：Series特性
　　Series支持NumPy模块的特性（下标）：
　　　　从ndarray创建Series：Series(arr)
　　　　与标量运算：sr*2
　　　　两个Series运算：sr1+sr2 #index跟index运算
　　　　索引：sr[0], sr[[1,2,4]]
　　　　切片：sr[0:2]（切片依然是视图形式）
　　　　通用函数：np.abs(sr)
　　　　布尔值过滤：sr[sr>0]
　　　　统计函数：mean() sum() cumsum()

　　Series支持字典的特性（标签）：
　　　　从字典创建Series：Series(dic),
　　　　in运算：’a’ in sr、for x in sr
　　　　键索引：sr['a'], sr[['a', 'b', 'd']]
　　　　键切片：sr['a':'c']
　　　　其他函数：get('a', default=0)等

整数索引
　　整数索引的pandas对象往往会使新手抓狂。
　　例：
　　　　sr = np.Series(np.arange(4.))
　　　　sr[-1]
　　如果索引是整数类型，则根据整数进行数据操作时总是面向标签的。
　　　　loc属性以标签解释
　　　　iloc属性以下标解释

数据对齐
　　pandas在运算时，会按索引进行对齐然后计算。如果存在不同的索引，则结果的索引是两个操作数索引的并集。
　　例：
　　sr1 = pd.Series([12,23,34], index=['c','a','d'])
　　sr2 = pd.Series([11,20,10], index=['d','c','a',])
　　sr1+sr2
　　sr3 = pd.Series([11,20,10,14], index=['d','c','a','b'])
　　sr1+sr3
　　# 没有相同索引的显示NaN，缺失值是NaN

　　如何在两个Series对象相加时将缺失值设为0？
　　sr1.add(sr2, fill_value=0)
　　灵活的算术方法：add, sub, div, mul

series缺失填充
　　缺失数据：使用NaN（Not a Number）来表示缺失数据。其值等于np.nan。内置的None值也会被当做NaN处理。
　　处理缺失数据的相关方法：
　　　　dropna() 过滤掉值为NaN的行
　　　　fillna() 填充缺失数据
　　　　isnull() 返回布尔数组，缺失值对应为True
　　　　notnull() 返回布尔数组，缺失值对应为False
　　过滤缺失数据：sr.dropna() 或 sr[data.notnull()]
　　填充缺失数据：fillna(0)

DataFrame
　　DataFrame是一个表格型的数据结构，含有一组有序的列。
　　DataFrame可以被看做是由Series组成的字典，并且共用一个索引。
　　创建方式：
　　　　pd.DataFrame({'one':[1,2,3,4],'two':[4,3,2,1]})
　　　　pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']), 'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})
　　　　……

　　DataFrame一般导入数据来用
　　csv文件读取与写入：
　　　　df.read_csv('filename.csv')
　　　　df.to_csv()

DataFrame查看数据
　　查看数据常用属性及方法：
　　　　index 获取索引
　　　　T 转置
　　　　columns 获取列索引
　　　　values 获取值数组
　　　　describe() 获取快速统计

　　DataFrame各列name属性：列名
　　　　rename(columns={'name': 'newname'})

DataFrame索引和切片
　　DataFrame有行索引和列索引。
　　通过标签获取：
　　　　df['A'] #A字段所有数据
　　　　df[['A', 'B']] #A、B字段所有数据（多字段需要再加[]）
　　　　df['A'][0] #A字段的第0行
　　　　df[0:10][['A', 'C']] #0到10行，字段A和B的数据(可以多个字段)

　　推荐下面的方法
　　　　df.loc[:,['A','B']]
　　　　df.loc[:,'A':'C'] #所有行，字段A和B的数据(最多俩个字段)
　　　　df.loc[0,'A'] #0行，A字段的数据
　　　　df.loc[0:10,['A','C']] #0到10行，字段A和B的数据(可以多个字段)

DataFrame索引和切片
　　通过位置获取：
　　　　df.iloc[3] #第三行（位置索引）
　　　　df.iloc[3,3] #第三行的第三列的值
　　　　df.iloc[0:3,4:6] #0到3行，4到6列的值
　　　　df.iloc[1:5,:] #1到5行的所有值
　　　　df.iloc[[1,2,4],[0,3]] #1、2、4行的0到3列的值

　　通过布尔值过滤：
　　　　df[df['A']>0] #显示A字段大于0的所有数据
　　　　df[df['A'].isin([1,3,5])] #显示A字段在[1,3,5]的所有数据
　　　　df[df<0] = 0 #？

DataFrame数据对齐与缺失数据
　　DataFrame对象在运算时，同样会进行数据对其，结果的行索引与列索引分别为两个操作数的行索引与列索引的并集。

　　DataFrame处理缺失数据的方法：
　　　　dropna(axis=0,how='any',…) #axis：1是列，0是行；how：all全部NaN删除，any有就删除
　　　　fillna() #替换NaN
　　　　isnull() #返回布尔
　　　　notnull() #与上面相反

pandas常用方法（适用Series和DataFrame）：
　　mean(axis=0,skipna=False)
　　sum(axis=1)
　　sort_index(axis, …, ascending) 按行或列索引排序
　　sort_values(by, axis, ascending) 按值排序
　　NumPy的通用函数同样适用于pandas

　　apply(func, axis=0) 将自定义函数应用在各行或者各列上，func可返回标量或者Series
　　applymap(func) 将函数应用在DataFrame各个元素上
　　map(func) 将函数应用在Series各个元素上

层次化索引：
　　层次化索引是Pandas的一项重要功能，它使我们能够在一个轴上拥有多个索引级别。
　　例：data=pd.Series(np.random.rand(9), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], [1,2,3,1,2,3,1,2,3]])

pandas时间对象：
　　时间序列类型：
　　　　时间戳：特定时刻
　　　　固定时期：如2017年7月
　　　　时间间隔：起始时间-结束时间
　　Python标准库：datetime
　　　　date time datetime timedelta
　　　　dt.strftime()
　　　　strptime()
　　第三方包：dateutil
　　　　dateutil.parser.parse()
　　成组处理日期：pandas
　　　　pd.to_datetime(['2001-01-01', '2002-02-02'])
　　产生时间对象数组：date_range
　　　　start 开始时间 string or datetime-like
　　　　end 结束时间 string or datetime-like
　　　　periods 时间长度 integer or None
　　　　freq 时间频率，默认为'D'，可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…

pandas文件读取
　　读取文件：从文件名、URL、文件对象中加载数据
　　　　read_csv 默认分隔符为csv
　　　　read_table 默认分隔符为
　　　　read_excel 读取excel文件
　　读取文件函数主要参数：
　　　　sep 指定分隔符，可用正则表达式如's+'
　　　　header=None 指定文件无列名
　　　　names 指定列名
　　　　index_col 指定某列作为索引
　　　　skip_row 指定跳过某些行
　　　　na_values 指定某些字符串表示缺失值
　　　　parse_dates 指定某些列是否被解析为日期，布尔值或列表

pandas写入数据
　　写入到文件：
　　　　to_csv
　　写入文件函数的主要参数：
　　　　sep
　　　　na_rep 指定缺失值转换的字符串，默认为空字符串
　　　　header=False 不输出列名一行
　　　　index=False 不输出行索引一列
　　　　cols 指定输出的列，传入列表

　　其他文件类型：json, XML, HTML, 数据库
　　pandas转换为二进制文件格式（pickle）:
　　　　save
　　　　load

Matplotlib：绘图和可视化

　　Matplotlib是一个强大的Python绘图和数据可视化的工具包。

　　安装方法：pip install matplotlib
　　引用方法：import matplotlib.pyplot as plt

　　绘图函数：plt.plot()
　　绘图也可以直接使用pandas的数据：df[['ma5', 'ma10']].plot()
　　显示图像：plt.show()

plot函数：
　　线型linestyle（-,-.,--,..）
　　点型marker（v,^,s,*,H,+,x,D,o,…）
　　颜色color（b,g,r,y,k,w,…）
　　plot函数绘制多条曲线
　　标题：title
　　x轴：xlabel
　　y轴：ylabel

其他类型图像：
　　hist 频数直方图

Matplotlib：画布与图

　　画布：figure
　　　　fig = plt.figure()
　　图：subplot
　　　　ax1 = fig.add_subplot(2,2,1)
　　调节子图间距：
　　　　subplots_adjust(left, bottom, right, top, wspace, hspace)

五日均线、十日均线求金叉死叉

数据源是网上下载一段时间的股票信息

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('601318.csv', index_col='date', parse_dates=['date'])

df['ma5'] = np.nan
df['ma10'] = np.nan

# 第一步计算ma
# 循环计算，速度非常慢
# for i in range(4, len(df)):
#     df.loc[df.index[i], 'ma5'] = df['close'][i-4:i+1].mean()
# for i in range(9, len(df)):
#     df.loc[df.index[i], 'ma10'] = df['close'][i-9:i+1].mean()

# 方案2：cumsum

# close     =  [10, 11, 12, 13, 14, 15, 16]
# close.cumsum=[10, 21, 33, 46, 60, 75, 91]
#                                -   -   -
#               [nan,nan,nan,nan,0,  10, 21, 33, 46, 60, 75, 91]

# sr = df['close'].cumsum()
# df['ma5'] = (sr - sr.shift(1).fillna(0).shift(4))/5
# df['ma10'] = (sr - sr.shift(1).fillna(0).shift(9))/10

# 方案3：rolling

df['ma5'] = df['close'].rolling(5).mean()
df['ma10'] = df['close'].rolling(10).mean()

df = df.dropna()

df[['ma5', 'ma10']].plot()
plt.show()
# 第二部 判断金叉死叉
# 方案一
# 金叉 短期<=长期 短期>长期
# 死叉 短期>=长期 短期<长期
# sr = df['ma5'] <= df['ma10']
#
# golden_cross = []
# death_cross = []
# for i in range(1, len(sr)):
#     # if sr.iloc[i] == True and sr.iloc[i + 1] == False: 开始想的是加1，但是索引溢出
#     if sr.iloc[i - 1] == True and sr.iloc[i] == False:
#         golden_cross.append(sr.index[i])
#     if sr.iloc[i - 1] == False and sr.iloc[i] == True:
#         death_cross.append(sr.index[i])

# 方案2

golden_cross = df[(df['ma5'] <= df['ma10']) & (df['ma5'] > df['ma10']).shift(1)].index
death_cross = df[(df['ma5'] >= df['ma10']) & (df['ma5'] < df['ma10']).shift(1)].index