数据分析之numpy

1 引言

1.1 python是一门十分优秀的语言：

各大开源工具都有python接口（Numpy, Pandas, Matplotlib, Scikit_Learn, tensorflow）
利用工具查API文档来进行学习

1.2 机器学习

主要可分为
- 算法
- 数据
- 程序
- 评估
- 应用

1.3 对接现今比较火的行业

数据挖掘（价值所在）
模式识别
统计学习
计算机视觉（后三个是发展最迅速的）
语音识别
自然语言处理（例如百度翻译，神经网络代替传统词库翻译）

1.4 深度学习了解

基于神经网络算法的延伸（算是人工智能范围的）

1.5 主要步骤

数据分析+预处理
特征选择+模型构建
评估与预测

2 Numpy用法

ndarry：行称为样本，列作为指标。整个矩阵进行运算

2.1 基本使用方法

查看格式和数据结构

import numpy

get_num = numpy.genfromtxt('../xls/text_for_numpy.txt', dtype='str', delimiter=',')
print(get_num)
'''
[['001' '厄运莎拉' '2001A']
 ['002' 'XML' '2002A']
 ['003' 'wei' '2003A']
 ['004' '123' '2004A']
 ['005' 'kkk' '2005A']]
'''
print(type(get_num))
# <class 'numpy.ndarray'>
# 所有的numpy都是 ndarry 的格式
print(numpy.array([get_num]))

num = [[1, 2, 3], [2, 3, 4]]
a_num = numpy.array(num)
# numpy.array中 数据结构必须一定
print(a_num)            # 转成array类型
# [[1 2 3]
#  [2 3 4]]
print(a_num.shape)      # 查看结构
# (2, 3)
print(a_num[0, 1])      # 2-------------> 0 代表第一个样本，1 代表第二个指标
print(a_num[0, :])      # 第一行
print(a_num[:, 0])      # 第一列

print(a_num == 5)
'''
bool 类型
[[False False False]
 [False False False]]
'''

2.2 关于bool类型充当索引的操作

2.2.1 可以返回真实向量值

import numpy
num_new = numpy.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 22, 11, 12]])
index = (num_new == 22)
print(index)
'''
[[False False False False]
 [False False False False]
 [False  True False False]]
'''
print(num_new[index])
'''
[22]    # 存在即返回，不存在返回[]
'''

2.2.2 可以返回真实样本值

import numpy
num_new = numpy.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 22, 11, 12]])
index_new = (num_new[:, 1] == 22)
print(num_new[index_new, :])        # [[ 9 22 11 12]]

2.3 与和或的使用

2.3.1 与

import numpy
num_new = numpy.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 22, 11, 12]])
# 与
print('一个数等于5或者11', (num_new == 5) & (num_new == 11))
'''
一个数等于5或者11 
[[False False False False]
 [False False False False]
 [False False False False]]
'''

2.3.2 或

import numpy
num_new = numpy.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 22, 11, 12]])
# 或
print('一个数可能等于5，也可能等于11', (num_new == 5) | (num_new == 11))
'''
一个数可能等于5，也可能等于11 
[[False False False False]
 [ True False False False]
 [False False  True False]]
'''

2.4 数组中整体转换

import numpy
num_new = numpy.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 22, 11, 12]])
print(num_new.dtype)
# int32
change_type = num_new.astype(float)
print(change_type)
'''
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 22. 11. 12.]]
'''
print(change_type.dtype)
# float64

2.5 数组中的运算

2.5.1 最大值和最小值

import numpy
num_new = numpy.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 22, 11, 12]])
print(num_new.min())        # 1
print(num_new.max())        # 22
print(num_new[1, :].max())  # 8(第二行最大值)

2.5.2 求和

import numpy
num_new = numpy.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 22, 11, 12]])
print(sum(num_new[1, :]))   # 26(第二行数据之和)
print(sum(num_new[:, 0]))   # 15(第一列数据之和)

2.5.3 指定维度求和

import numpy
num_new = numpy.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 22, 11, 12]])
print(num_new.sum(axis=1))  # 行求和  [10 26 54]
print(num_new.sum(axis=0))  # 列求和  [15 30 21 24]

2.6 arrange结合reshape创建矩阵

2.6.1 常规创建

print(numpy.arange(18))
# [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17]
a = numpy.arange(18).reshape(3, 6)
print(a)
'''
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]]
'''
print(a.shape)      # 查看行和列分别是多少  (3, 6)
print(a.ndim)       # 查看维度  2
print(a.dtype)      # 查看数据类型   int32
print(a.size)       # 查看多少个数据  18

2.6.2 初始化一个都是0的矩阵

print(numpy.zeros((2, 3)))
# (2, 3)元组形式，不然报错
'''
[[0. 0. 0.]
 [0. 0. 0.]]
'''

2.6.3 初始化一个都是1的矩阵

print(numpy.ones((2, 3, 2)))    # 2行3列的二维矩阵
'''
[[[1. 1.]
  [1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]
  [1. 1.]]]
'''

2.6.4 构建奇数列

类似于range

print(numpy.arange(1, 30, 2))
# [ 1  3  5  7  9 11 13 15 17 19 21 23 25 27 29]

2.6.5 构造随机数矩阵

random

print(numpy.random.random((2, 3)))
'''
[[0.69185679 0.67417911 0.10834083]
 [0.2324281  0.77956343 0.57141579]]
'''

linspace

print(numpy.linspace(0, 2*numpy.pi, 100))
# pi是Π，起始值是0，终点是 2Π，造出100个值
'''
[0.         0.06346652 0.12693304 0.19039955 0.25386607 0.31733259
 0.38079911 0.44426563 0.50773215 0.57119866 0.63466518 0.6981317
 0.76159822 0.82506474 0.88853126 0.95199777 1.01546429 1.07893081
 1.14239733 1.20586385 1.26933037 1.33279688 1.3962634  1.45972992
 1.52319644 1.58666296 1.65012947 1.71359599 1.77706251 1.84052903
 1.90399555 1.96746207 2.03092858 2.0943951  2.15786162 2.22132814
 2.28479466 2.34826118 2.41172769 2.47519421 2.53866073 2.60212725
 2.66559377 2.72906028 2.7925268  2.85599332 2.91945984 2.98292636
 3.04639288 3.10985939 3.17332591 3.23679243 3.30025895 3.36372547
 3.42719199 3.4906585  3.55412502 3.61759154 3.68105806 3.74452458
 3.8079911  3.87145761 3.93492413 3.99839065 4.06185717 4.12532369
 4.1887902  4.25225672 4.31572324 4.37918976 4.44265628 4.5061228
 4.56958931 4.63305583 4.69652235 4.75998887 4.82345539 4.88692191
 4.95038842 5.01385494 5.07732146 5.14078798 5.2042545  5.26772102
 5.33118753 5.39465405 5.45812057 5.52158709 5.58505361 5.64852012
 5.71198664 5.77545316 5.83891968 5.9023862  5.96585272 6.02931923
 6.09278575 6.15625227 6.21971879 6.28318531]
'''

2.7 矩阵之间的运算

矩阵之间的运算相当于对应位置之间元素之间的运算

2.7.1 常规

v = numpy.array([[1, 2, 3, 5], [7, 8, 9, 10]])
n = numpy.arange(0, 8, 1).reshape(2, 4)
print('v', v)
'''
v 
[[ 1  2  3  5]
 [ 7  8  9 10]]
'''
print('n', n)
'''
n 
[[0 1 2 3]
 [4 5 6 7]]
'''
print('v+n', v+n)
'''
v+n 
[[ 1  3  5  8]
 [11 13 15 17]]
'''
print('v-n', v-n)
'''
v-n 
[[1 1 1 2]
 [3 3 3 3]]
'''
print('v*n', v*n)
'''
v*n 
[[ 0  2  6 15]
 [28 40 54 70]]
'''
print('v*2', v*2)
'''
v*2 
[[ 2  4  6 10]
 [14 16 18 20]]
'''
print(v < 35)
'''
[[ True  True  True  True]
 [ True  True  True  True]]
'''

2.7.2 对积

k = numpy.array([[1, 2], [3, 4]])
p = numpy.array([[6, 7], [9, 10]])
print(k)
print('------------')
print(p)
print('------------')
print(k*p)
print('------------')
print(k.dot(p))
print('------------')
print(numpy.dot(k, p))
'''
k
[[1 2]
 [3 4]]
------------
p
[[ 6  7]
 [ 9 10]]
------------
k*p
[[ 6 14]
 [27 40]]
------------
k.dot(p)       
[[24 27]          # 矩阵乘法  [1*6+2*9，1*7+2*10            
 [54 61]]					  3*6+4*9, 3*7+4*10]
------------
numpy.dot(k, p)
[[24 27]
 [54 61]]
'''

2.7.3 自然对数e

import numpy as np
B = np.arange(3)
print(B)
# [0 1 2]
print(np.exp(B))        # e的几次幂
# [1.         2.71828183 7.3890561 ]

2.7.4 二次根号

import numpy as np
B = np.arange(3)
print(B)
# [0 1 2]
print(np.sqrt(B))        # 数组中二次根号
# [0.         1.         1.41421356]

2.7.5 floor 向下取整

import numpy as np
ap = np.floor(10*np.random.random((3, 4)))
print(ap)
'''
[[0. 1. 9. 4.]
 [0. 1. 9. 3.]
 [9. 1. 8. 8.]]
'''

3 Numpy进阶用法

3.1 矩阵变成向量

先用ravel()方法，再用shape()定义

import numpy as np
ap = np.floor(10*np.random.random((3, 4)))
print(ap.ravel())
# [0. 1. 9. 4. 0. 1. 9. 3. 9. 1. 8. 8.]
# 把矩阵变成向量
ap.shape = (6, 2)
# ap.shape = (6,-1) 相同结果，因为二维中确定行，列数是自动计算的，写一个默认值 -1，只是让numpy自己计算
print(ap)
# [[0. 1.]
#  [9. 4.]
#  [0. 1.]
#  [9. 3.]
#  [9. 1.]
#  [8. 8.]]
print(a.T)
# 转至，行和列变换
'''
[[ 0  6 12]
 [ 1  7 13]
 [ 2  8 14]
 [ 3  9 15]
 [ 4 10 16]
 [ 5 11 17]]
'''

3.2 矩阵拼接

3.2.1 横向拼接

import numpy as np
h1 = np.floor(10*np.random.random((2, 2)))
h2 = np.floor(10*np.random.random((2, 2)))
print(h1, h2, np.hstack((h1, h2)))      # 注意元祖格式
'''
[[5. 4.]            [[1. 9.]            [[5. 4. 1. 9.]
 [4. 9.]]            [6. 9.]]            [4. 9. 6. 9.]]
'''

3.2.2 纵向拼接

import numpy as np
h1 = np.floor(10*np.random.random((2, 2)))
h2 = np.floor(10*np.random.random((2, 2)))
print(h1, h2, np.vstack((h1, h2)))
'''
[[8. 6.]            [[7. 1.]            [[8. 6.]
 [6. 2.]]            [3. 9.]]            [6. 2.]
                                         [7. 1.]
                                         [3. 9.]]
'''

3.3 矩阵切割

3.3.1 横向切割

平均切割

import numpy as np
j = np.floor(10*np.random.random((2, 12)))
# 平均切割
print(j, np.hsplit(j, 3))
'''
[[7. 6. 3. 7. 1. 9. 1. 7. 6. 9. 0. 4.]
 [2. 3. 3. 5. 7. 0. 6. 8. 4. 6. 7. 0.]]
 ----------------------------------------
 [array([[7., 6., 3., 7.],[2., 3., 3., 5.]]), 
  array([[1., 9., 1., 7.],[7., 0., 6., 8.]]), 
  array([[6., 9., 0., 4.],[4., 6., 7., 0.]])]
'''

指定切割

import numpy as np
j = np.floor(10*np.random.random((2, 12)))
print(j, np.hsplit(j, (3, 4)))
'''
[[7. 6. 3. 7. 1. 9. 1. 7. 6. 9. 0. 4.]
 [2. 3. 3. 5. 7. 0. 6. 8. 4. 6. 7. 0.]]
 ----------------------------------------
[array([[7., 6., 3.],[2., 3., 3.]]), 
 array([[7.],[5.]]), 
 array([[1., 9., 1., 7., 6., 9., 0., 4.],[7., 0., 6., 8., 4., 6., 7., 0.]])] 
 '''

3.3.2 纵向切割

平均切割

import numpy as np
j = np.floor(10*np.random.random((2, 12)))
print(j.T, np.vsplit(j.T, 3))
'''
[[6. 3.]
 [9. 0.]
 [5. 2.]
 [4. 8.]
 [8. 2.]
 [1. 9.]
 [4. 0.]
 [6. 2.]
 [8. 3.]
 [5. 7.]
 [0. 7.]
 [2. 1.]]
----------------------------
 [array([[6., 3.],
       [9., 0.],
       [5., 2.],
       [4., 8.]]), 
  array([[8., 2.],
       [1., 9.],
       [4., 0.],
       [6., 2.]]), 
  array([[8., 3.],
       [5., 7.],
       [0., 7.],
       [2., 1.]])]
'''

指定切割

import numpy as np
j = np.floor(10*np.random.random((2, 12)))
print(j.T, np.vsplit(j.T, (3, 4)))
'''
[[6. 3.]
 [9. 0.]
 [5. 2.]
 [4. 8.]
 [8. 2.]
 [1. 9.]
 [4. 0.]
 [6. 2.]
 [8. 3.]
 [5. 7.]
 [0. 7.]
 [2. 1.]]
----------------------------
 [array([[6., 3.],
       [9., 0.],
       [5., 2.]]), 
  array([[4., 8.]]), 
  array([[8., 2.],
       [1., 9.],
       [4., 0.],
       [6., 2.],
       [8., 3.],
       [5., 7.],
       [0., 7.],
       [2., 1.]])]
'''

3.4 矩阵复制

3.4.1 原理分析

na = np.arange(12)
nb = na
print(nb is na)     # True
print(na.shape)     # (12,)
print(id(na))       # 1946844535904
print(id(nb))       # 1946844535904

3.4.2 浅复制

虽然指向不同地址，但是元素值共用

nc = na.view()
print(nc is na)         # False
nc.shape = 2, 6
print(na.shape)         # (12,)
nc[0, 4] = 1234
print(na)               # [   0    1    2    3 1234    5    6    7    8    9   10   11]
print(nc)               # [[   0    1    2    3 1234    5]
                        #  [   6    7    8    9   10   11]]
print(id(na))           # 1217619840656
print(id(nc))           # 1217619839776

3.4.3 深复制

nna = np.arange(12).reshape((2, -1))
nd = nna.copy()         # False
print(nd is nna)
nd[0, 0] = 9999
print(nna)
'''
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]]
'''
print(nd)
'''
[[9999    1    2    3    4    5]
 [   6    7    8    9   10   11]]
'''

3.5 索引的使用

sre = np.sin(np.arange((20)).reshape(5, 4))
print(sre)
'''
[[ 0.          0.84147098  0.90929743  0.14112001]
 [-0.7568025  -0.95892427 -0.2794155   0.6569866 ]
 [ 0.98935825  0.41211849 -0.54402111 -0.99999021]
 [-0.53657292  0.42016704  0.99060736  0.65028784]
 [-0.28790332 -0.96139749 -0.75098725  0.14987721]]
'''
ind = sre.argmax(axis=0)        # 列查找
print(ind)
'''
[2 0 3 1]
'''
sre_max = sre[ind, range(sre.shape[1])]
print(sre_max)
'''
[0.98935825 0.84147098 0.99060736 0.6569866 ]

3.6 矩阵的扩展

ss = np.arange(0, 40, 10)
print(ss)
# [ 0 10 20 30]
pp = np.tile(ss, (3, 5))
print(pp)
'''
[[ 0 10 20 30  0 10 20 30  0 10 20 30  0 10 20 30  0 10 20 30]
 [ 0 10 20 30  0 10 20 30  0 10 20 30  0 10 20 30  0 10 20 30]
 [ 0 10 20 30  0 10 20 30  0 10 20 30  0 10 20 30  0 10 20 30]]
'''

3.7 根据索引进行排序

e = np.array([[4, 3, 5], [6, 2, 1]])
print(e)
'''
[[4 3 5]
 [6 2 1]]
'''
k = np.sort(e, axis=1)
print(k)
'''
[[3 4 5]
 [1 2 6]]
'''

po = np.array([4, 3, 1, 2])
ol = np.argsort(po)
print(ol)
# 从小到大进行排列的索引值为  [2 3 1 0]
print(po[ol])
# 排序完成的数组为 [1 2 3 4]