python金融量化分析

IPython 交互式的python命令行
可以直接粘贴代码到命令行

安装: pip install ipython

TAB键: 自动补全
?: (内省、命名空间搜索。a.a*? #补全命令,a? #查看变量详情，func??查看函数详情)
!: 执行cmd系统命令 (!ipconfig)
%run: 执行文件代码(%run test.py) 
%paste,%cpaste: 执行剪贴板代码
%timeit: 计算函数运行的时间。%timeit func(a,b)
%pdb:  进入调试模式，(%pdb on/%pdb off).会停在错误代码的地方,不报错。p a 打印a变量
历史命令: _,__,_2,_i2 #例子 a+b,a*b, _代表a*b的结果,__代表a+b的结果，_2代表命令行上的num，显示2的结果，_i2显示2的代码
%bookmark: 目录标签系统
Ipython Notebook :  pip install jupyter #是一个代码编辑器 运行: jupyter notebook

View Code

NumPy 数据计算模块
NumPy是高性能科学计算数据分析的基础包，它是pandas等其他各种工具的基础
NumPy的主要功能:
ndarray ，一个多维数据结构，高效且节省时间
无需循环对整组数据进行快速运算的数学函数
读写磁盘数据的工具以及用于操作内存映射文件的工具
线性代数，随机数生成和傅里叶变换功能
用于集成C、C++等代码的工具
pip install numpy
import numpy as np

numpy方法

创建ndarray: np.array()
为什么要使用ndarray:
    例1 已知若干家跨国公司的市值(美元)，将其换算为人民币
    例2 已知购物车中每件商品的价格与商品件数，求总金额
ndarray还可以是多组数组，但元素类型必须相同
常用属性:
    T      数组的转置(对高级数组而言)。转变成二维数组之类的
    dtype 数组元素的数据类型。修改数据类型a.dtype='int64'
    size  数组元素的个数
    ndim  数组的维数
    shape 数组的维度大小(以元组形式)
    
例子:
import sys 
sys.getsizeof(a)  #查看占用内存的大小。array明显比list小
prize=[round(random.uniform(10.0,20.0),2) for i in range(20)]
prize_np=np.array(prize)

num=[random.randint(1,10) for i in range(20)]
num_np=np.array(num)
in [100]: np.dot(prize_np,num_np)   #求出两个数组中每个数相乘之后的总和
Out[100]: 1444.09

prize_np*num_np   #求出两个数组中每个数相乘
In [102]: _.sum()
Out[102]: 1444.0899999999997  #同dot一样

prize_np * 2   #数组中每个数乘以2。以此类推* / + -均可


In [104]: z=np.array([[1,2,3],[4,5,6]])
In [105]: z
Out[105]:
array([[1, 2, 3],
       [4, 5, 6]])
In [106]: z.size
Out[106]: 6
In [107]: z.shape
Out[107]: (2, 3)

In [108]: z.T   #行变成列，列变成行
Out[108]:
array([[1, 4],
       [2, 5],
       [3, 6]])

In [112]: z=z.astype('float32') #修改类型
In [113]: z
Out[113]:
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

np.array(a,dtype='float')  #也可以在创建的时候指定为float类型     

dtype:
    bool,int(8,16,32,64),uint(8,16,32,64),float(16,32,64)
类型转换astype()
array()       将列表转换为数组，可选择显示指定dtype
arange()   range的numpy版，支持浮点数
linspace() 类似arange(),第三个参数为数组长度
zeros()    根据指定形状和dtype创建全0数组
ones()     根据指定形状和dtype创建全1数组
empty()    根据指定形状和dtype创建空数组(随机值)
eye()      根据指定边长和dtype创建单位矩阵

例子:
    
In [120]: np.arange(1,10,0.2) #步长可以设置为小数
Out[120]:
array([1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. , 3.2, 3.4,
       3.6, 3.8, 4. , 4.2, 4.4, 4.6, 4.8, 5. , 5.2, 5.4, 5.6, 5.8, 6. ,
       6.2, 6.4, 6.6, 6.8, 7. , 7.2, 7.4, 7.6, 7.8, 8. , 8.2, 8.4, 8.6,
       8.8, 9. , 9.2, 9.4, 9.6, 9.8])
       
In [126]: np.linspace(1,10,15)  #生成15个线性代数

In [127]: np.zeros(10)
Out[127]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [128]: np.zeros(10,dtype='int')  #指定类型
Out[128]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

np.zeros((3,5,10))  #生成三维数组
Out[126]:
array([ 1.        ,  1.64285714,  2.28571429,  2.92857143,  3.57142857,
        4.21428571,  4.85714286,  5.5       ,  6.14285714,  6.78571429,
        7.42857143,  8.07142857,  8.71428571,  9.35714286, 10.        ])

In [129]: np.zeros((3,5))  #生成二维数组  全是0
Out[129]:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])
       
In [141]: np.ones(10)      #生成数组，全是1
Out[141]: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [151]: np.empty(10)     #生成空数组。数字为未释放的内存数字。和zeros用法一样
Out[151]: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [156]: np.arange(10)
Out[156]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [157]: a=np.arange(10)
In [158]: a.reshape((2,5))
Out[158]:
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
In [159]: b=a.reshape((2,5))   #生成二维数组
In [160]: b.ndim
Out[160]: 2

In [161]: np.eye(5)  #根据指定边长和dtype创建单位矩阵
Out[161]:
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [162]: np.eye(2)
Out[162]:
array([[1., 0.],
       [0., 1.]])
       
In [168]: a=np.arange(15).reshape(3,5) #生成二维数组
In [169]: a
Out[169]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
In [170]: a.ndim
Out[170]: 2

View Code

NumPy 索引和切片

数组和标量之间的运算
* // + -
同样大小数组之间的运算
+ / **
数组的索引
a[5],a2[2][3],a2[2,3]  #逗号前的是行，逗号后的是列
数组的切片
a[5:8],a[:3]=1,a2[1:2,:4],a2[:,:1],a2[:,1]
与列表不同，数组切片时并不会自动复制，在切片数组上的修改会影响原数组【解决方法:copy()】
b=a[:4],b=a[:4].copy(),b[-1]=200

布尔型索引
问题1：给一个数组，选出数组中所有大于5的数字。
a[a>5]
问题2: 给一个数组，选出数组中所有大于5的偶数
a[(a>5) & (a%2==0)]
问题3: 给一个数组，选出数组中所有大于5的数和偶数
a[(a>5) | (a%2==0)]

花式索引
问题1：对于一个数组，选出其第1，3，4，6，7个元素，组成新的二维数组
a[[1,3,4,6,7]]

问题2：对于一个二维数组，选出其第一列和第三列，组成新的二维数组
a[:,[1,3]]

通用函数:能同时对数组中所有元素进行运算的函数
常见通用函数
    一元函数:abs,sqrt,exp,log,ceil,floor,rint,trunc,modf,isnan,isinf,(isnan，isinf都是取的True)cos,sin,tan
    二元函数:add,substract,multiply,divide,power,mod,maximum,mininum
    
例子:
    In [1]: import math
    In [3]: math.ceil(3.5) #加小数位取整
    Out[3]: 4
    
    In [5]: math.floor(3.5)#去小数位取整
    Out[5]: 3
    
    In [6]: math.trunc(3.1)#取整
    Out[6]: 3
    
    nan 代表缺失值  0/0会出现  
    In [29]: c[~np.isnan(c)]  #取不是nan的数字
    Out[29]: array([1., 1., 1., 1.])
    
    inf 代表无限大数字 5/0会出现
    
    In [34]: np.maximum(a,b)   #位与位对比，取大的。mininum相反
    Out[34]: array([5, 4, 3, 4, 0])

数学和统计方法
    常用函数
        sum    求和，cumsum累计求和，mean求平均数，std求标准差，var 求方差
        min    最小值，max最大值，argmin 求最小值索引，argmax求最大值索引
例子:
    In [35]: np.cumsum(b) #该数与该数之前的和
    Out[35]: array([ 5,  9, 12, 14, 14], dtype=int32)
    
    In [36]: np.mean(b)   #平均值
    Out[36]: 2.8
    
随机数生成
    rand    给定形状产生随机数组(0到1之间的数)
    randint 给定形状产生随机整数
    choice  给定形状产生随机选择
    shuffle 与random.shuffle相同(洗牌)
    uniform 给定形状产生随机数组

例子:
    In [88]: np.random.rand()*9+1  #1到10的随机小数
    Out[88]: 7.197970422073758

    In [69]: np.random.randint(1,15,(3,5)) #生成一个二维的数组
    Out[69]:
    array([[ 1,  7,  1,  8,  2],
           [ 3,  2, 11,  9,  4],
           [ 9, 13,  8, 14,  9]])
           
    In [110]: np.random.shuffle(b)
    In [111]: b
    Out[111]: array([4, 1, 3, 5, 2])
    
    In [125]: np.random.uniform(1,10,2)  #生成两个1到10的随机数字
    Out[125]: array([3.77433418, 5.27087254])

View Code

pandas: Series

pip install pandas
import pandas as pd
Series是一种类似于一维数组的对象，有一组数据和一组与之相关的数据标签(索引)组成
series比较像列表(数组)和字典的结合体
创建方式
    pd.Series([1,2,4,2])
    pd.Series([1,2,4,2],index=['a','b','c','d'])
    pd.Series({'a':1,'b':2})
    pd.Series(0,index=['a','b','c','d']) 同pd.Series(0,index=list('abcd'))
    
获取值数组和索引数组:values属性和index属性

Series特性
Series支持NumPy模块的特性(下标)
从ndarray创建Series: Series(arr)
与标量运算:  a*2
两个Series运算:  a+b
索引:a[0],a[[1,2,4]]
切片:a[0:2]
通用函数:np.abs(a)
布尔值过滤: a[a>0]

Series支持字典的特性(标签):
从字典创建Series: Series(dict)
in运算: 'a' in a #'a'是标签
键索引: a['a'],a[['a','b','c']]
a.get('aaa',default=0) #获取标签为aaa的值。如没有就显示0

整数索引的pandas对象
如果索引是整数类型，则根据整数进行数据操作时总是要面向标签的
例子: a=pd.Series(np.arange(20))
      b=a[10:].copy()
      b[-1]  #一定会报错
解决办法
    b.loc[10]  #通过标签解释
    b.iloc[-1] #通过下标解释
a.values * b.values  #保证两个数组的元素个数相同

Series数据对齐
pandas在运算时，会按索引进行对齐然后计算，如果存在不同的索引，则结果的索引是两个操作索引的并集
例子:
    In [242]: a=pd.Series(np.arange(5),index=list('abcde'))

    In [243]: b=pd.Series(np.arange(5),index=list('cebad'))

    In [245]: a
    Out[245]:
    a    0
    b    1
    c    2
    d    3
    e    4


如何在两个Series对象相加时将缺失值设为0？
    a.add(b,fill_value=0)
灵活的算术方法:add,sub,div,mul
例子:
    In [249]: b
    Out[249]:
    c    0
    e    1
    b    2
    d    4
    dtype: int32

    In [250]: a
    Out[250]:
    a    0
    b    1
    c    2
    d    3
    e    4
    dtype: int32

    In [251]: a+b
    Out[251]:
    a    NaN
    b    3.0
    c    2.0
    d    7.0
    e    5.0
    dtype: float64

    In [253]: a.add(b,fill_value=0)
    Out[253]:
    a    0.0
    b    3.0
    c    2.0
    d    7.0
    e    5.0
    dtype: float64

Series缺失数据：使用NaN来表示缺失数据，其值等于np.nan内置的None值也会被当作NaN处理
处理缺失数据的相关方法：
  dropna()    过滤掉值为NaN的行
  fillna()  填充缺失数据
  isnull()  返回布尔数组，缺失值对应为True
  notnull() 返回布尔数组，缺失值对应为False
  过滤缺失数据a.dropna()或a[a.notnull()]
  填充缺失数据:fillna(0)

例子:
    In [255]: a+b
    Out[255]:
    a    NaN
    b    3.0
    c    2.0
    d    7.0
    e    5.0
    dtype: float64

    In [256]: c=_

    In [257]: c.dropna()
    Out[257]:
    b    3.0
    c    2.0
    d    7.0
    e    5.0
    dtype: float64

    In [258]: c.fillna(0)
    Out[258]:
    a    0.0
    b    3.0
    c    2.0
    d    7.0
    e    5.0
    dtype: float64

    In [259]: c.isnull()
    Out[259]:
    a     True
    b    False
    c    False
    d    False
    e    False
    dtype: bool

    In [260]: c[c.isnull()]
    Out[260]:
    a   NaN
    dtype: float64

    In [261]: c[~c.isnull()]
    Out[261]:
    b    3.0
    c    2.0
    d    7.0
    e    5.0
    dtype: float64

    In [262]: c[c.notnull()]
    Out[262]:
    b    3.0
    c    2.0
    d    7.0
    e    5.0
    dtype: float64

    In [263]:

View Code

pandas: DataFrame

DataFrame是一个表格型的数据结构，还有一组有序的列
DataFrame可以被看作是由Series组成的字典，并且共用一个索引
创建方式
pd.DataFrame({'one':[1,2,3,4],'two':[4,3,2,1]})
pd.DataFrame({'one':pd.Series([1,2,3,4],index=['a','b','c','d']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})
csv文件读取与写入
df=pd.read_csv('test.csv')   #把csv读出啦
df.to_csv('test2.csv')       #存一个csv文件

DataFrame查看数据
查看数据常用属性及方法
index        获取行索引
T            转置
columns      获取列索引
values       获取值数组
describe     获取快速统计
DataFrame各列name属性:列名
rename(columns={...}) #如df.rename(columns={'close':'new_close','open':'new_open'}) #修改字段名
df.index.name='iid' #修改索引字段名

DataFrame索引和切片
DataFrame有行索引和列索引
通过标签获取:
   df['close']          #取一列
   df['close'][0]       #取一个数字
   df[['close','open']] #取两列，二维数组
   df[0:10][['close','open']] #前十行的两列数据
   
DataFrame索引和切片
通过位置获取:
   df.iloc[3]   #取第3行数据
   df.iloc[3,3] #取第3行的第3个数据
   df.iloc[0:3,4:6] #取前3行的4，5列数据
   df.iloc[1:5,:]   #取前4行的所有数据
   df.iloc[[1,2,4],[0,3]] #取第1，2，4行，第3列的数据

通过布尔值过滤:
   df[df['close']>20] #取close列大于20的值的行
   df[df['date'].isin(['2007/3/1','2007/3/4'])]  #取2007.3.1到2007.3.4之间的行
   df[df['id'].isin([1,3,5,7])]  #取id列1，3，5，7行的数据
   
DataFrame数据对齐与缺失数据
DataFrame对象在运算时，同样会进行数据对齐，结果的行索引与列索引分别为两个操作数的行索引与列索引的并集

DataFrame处理缺失数据的方法
dropna(axis=0,how='any')  #axis 默认0等于列，1等于行。how 默认any任意一个就删除，all全部都是nan就删除
fillna()，isnull()，notnull()

pandas其他常用方法(适用Series和DataFrame)
mean(axis=0,skipna=False)
sum(axis=1)
sort_index(axis,...,ascending) 按行或列索引排序 如 df.sort_index(ascending=True,axis=1)，按列排序。与之相反df.sort_index(ascending=False,axis=0)
sort_values(by,axis,ascending) 按值排序  如df.sort_values('close',ascending=True) close列的数据从大到小排序
NumPy的通用函数同样适用于pandas
apply(func,axis=0) 将自定义函数应用在各行或者各列上，func可返回标量或者Series df2.apply(lambda x:x.sum()) 各列值之和。同df2.sum()
applymap(func) 将函数应用在DataFrame各个元素上 如df2.applymap(lambda x:x+1) 每一个数字都加1
map(func)  将函数应用在Series各个元素上

pandas层次化索引
层次化索引是pandas的一项重要功能，它使我们能够在一个轴上拥有多个索引级别
例:
  data=pd.Series(np.random.rand(9),index=[['a','a','a','b','b','b','c','c','c'],[1,2,3,1,2,3,1,2,3]])
  data['a'][1]  #两层索引
 
pandas从文件读取
读取文件:从文件名，url，文件对象中加载数据
read_csv   默认分隔符为逗号
read_table 默认分隔符为	
读取文件函数主要参数:
  sep         指定分隔符，可以用正则表达式如:"S+"
  header=None 指定文件无列名
  names       指定列名
  index_col   指定某列作为索引
  skiprows    指定跳过某些行
  na_values   指定某些字符串表示缺失值。默认就会将缺失值读取成NaN。所以可以不指定
  parse_dates 指定某些列是否解析为日期，布尔值或列表
  nrows       指定读取几行文件
  chunksize   分块读取文件，指定块大小
  
例子:
   
     df=pd.read_table('test.csv', #读取文件
     sep=',',  #以逗号分割，默认读取一整行
     header=None, #没有列名
     names=['id','date','open','close','high','low','valume','code'], #设置列名
     skiprows=[1,2,3,4], #跳过0到3行
     index_col='date',   #date列设置为索引
     parse_dates=['date']) #将date转换成时间对象。也可以写布尔值

type(df.index[0]) 查看类型
pandas写入到文件
写入到文件
to_csv
写入文件函数的主要参数
sep
na_sep       指定缺失值转换的字符串，默认为空字符串
header=False 不输出列名一行
index=False  不输出索引一行 #这样可以解决读取文件的时候多出一列，写入文件的时候过滤掉多出来的列
cols         指定输出的列、传入列表
其他文件类型:json/XML/HTML/数据库

pandas时间对象处理
时间序列类型:
    时间戳： 特定时刻
    固定时期：如2017年7月
    时间间隔：起始时间~结束时间
  
Python标准库：datetime
    date time  datetime timedelta
    strftime()
    strptime()

第三方包: dateutil
    dateutil.parser.parse("2017-01-01")。如没有日的话。就以现在的日期填充

成组处理日期:pandas
    pd.to_datetime(['2017-01-01','2017-01-02'])

产生时间对象数组:date_range
start    开始时间
end        结束时间
periods 时间长度
freq    时间频率，默认为D，可选
H(our),W(eek),B(usiness),S(emi-)M(onth),M(min)T(es),S(econd),A(year)

pd.date_range("2017-01-01",'2017-02-01',frep='B')  #1月1，到2月1的工作日。SM半月，Y年，Y-MON每周一
pd.date_range("2017-01-01",periods=100)  #生成100天
df['2007']  #获取有2007的数据
df['2007-03-01':'2007-03-08'] #获取一段时间

时间序列就是以时间对象为索引的Series或DataFrame
datetime对象作为索引时是存储在DatetimeIndex对象的
时间序列特殊功能
传入年或年月作为切片方式
传入日期范围作为切片方式

View Code