numpy模块

专门进行数组（矩阵）的运算

给定两个列表，将他们看成向量(数组)如何让他们的元素一一相乘，得到[4, 10, 18]结果呢？

lis1 = [1, 2, 3]  
lis2 = [4, 5, 6]

学过for循环后很容易就能想到方法：

lis = []
for i in range(len(lis1)):
    lis.append(lis1[i] * lis2[i])
print(lis)		# [4, 10, 18]

但是如果你使用numpy的话，只要一行代码就可以实现了，

import numpy as np

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print(arr1*arr2)	# [ 4 10 18]

是不是觉得这个方法还是有点顶的

当然numpy模块的用法可不仅仅这么简单，他能实现大部分数组的运算

那么numpy的数组是怎么表达的呢？

arr = np.array([1, 2, 3])
print(arr)  # 一维的numpy数组
结果为：
[1 2 3]

arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)  # 二维的numpy数组(一般就是二维)
结果为：
[[1 2 3]
 [4 5 6]]

arr3 = np.array([[[1, 2, 3], [4, 5, 6]],[[1, 2, 3], [4, 5, 6]],[[1, 2, 3], [4, 5, 6]]])
print(arr3)
结果为：
[[[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]]

三维的数组一般不使用numpy模块,使用tensorflow/pytorch模块

numpy的属性

T 数组的转置（对高维数组而言）
dtype 数组元素的数据类型
size 数组元素的个数
ndim 数组的维数
shape 数组的维度大小（以元组形式）
astype 类型转换

数组的转置

数组的转置其实就是行与列的互换

arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)  # 二维的numpy数组(一般就是二维)
print(arr2.T)  # 行与列互换

结果比较：

[[1 2 3]
[4 5 6]]

[[1 4]
[2 5]
[3 6]]

数组元素的数据类型及类型转换

arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2.dtype)  # python中的数据类型
print(arr2.astype(np.float64).dtype)

结果为：int32，float64

数组元素个数及维度大小

元素个数即使行与列相乘，而维度大小则是有几行几列

arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2.size)	# 6
print(arr2.shape)	# (2, 3)

数组的维数

几维数组就是几

print(arr2.ndim)	# 2

切片

切片类似于字符串的切片，多维数组时用逗号隔开

arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)  # 二维的numpy数组(一般就是二维)
print(arr2[:, :])	#结果是一样的

当然也有其他的用法

print(arr2[0:1, :])		# 指取第一行和全部列
print(arr2[0:1, 0:1])	  # 第一行第一列，即为一个元素
print(arr2[0, :])	# 第一行全部列
print(arr2[0, 0], type(arr2[0, 0]))   # 第一行第一列及其类型
print(arr2[0, [0, 2]])	  # 第一行，敌一列，第三列的元素 
print(arr2[0, 0] + 1)	# 得出第一行第一列的元素后加1

结果为：

[[1 2 3]]
[[1]]
[1 2 3]
1 <class 'numpy.int32'>
[1 3]
2

修改值

numpy数组的修改和列表的修改还是类似的

arr2 = np.array([[1, 2, 3], [4, 5, 6]])  # 可变数据类型
print(arr2)  # 二维的numpy数组(一般就是二维)
arr2[0, :] = 0	# 即第一行的所有元素改为零
print(arr2)  
arr2[1, 1] = 1	# 敌二行第二列元素改为1
print(arr2)	 
arr2[arr2 < 3] = 3  # 布尔取值（所有小于3的值改为3）
print(arr2)

结果为：

[[1 2 3]
[4 5 6]]
[[0 0 0]
[4 5 6]]
[[0 0 0]
[4 1 6]]
[[3 3 3]
[4 3 6]]

合并

因为numpy数组是可变数据类型，所以他们是可以合并的

numpy合并有俩种方式，列合并和行合并

arr1 = np.array([[1, 2, 3], [4, 5, 6]])  # 可变数据类型

arr2 = np.array([[7, 8, 9], [10, 11, 12]])  # 可变数据类型


print(np.hstack((arr1,arr2)))  # 行合并
print(np.vstack((arr1,arr2)))  # 列合并

print(np.concatenate((arr1, arr2)))  # 默认列合并
print(np.concatenate((arr1, arr2),axis=1))  # 1表示行;0表示列

结果为：

[[ 1 2 3 7 8 9]
[ 4 5 6 10 11 12]]
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
[[ 1 2 3 7 8 9]
[ 4 5 6 10 11 12]]

通过函数创建numpy数组

rr1 = np.array([[1, 2, 3], [4, 5, 6]])  # 可变数据类型

print(np.zeros((5, 5)))  # 五行五列，元素全为零的数组
print(np.ones((5, 5)) * 100)  # 这时元素变为100
print(np.eye(5))  # 五行五列单位矩阵

print(np.arange(1,10,2))  # 生成一维的，由一到10，依次加2
print(np.linspace(0,20,10)) # 平均分成10份  # 构造x坐标轴的值


arr = np.zeros((2, 2))
print(arr.reshape((1,4)))  # 将多维数组变为1行25列（如果相乘不相等会报错）

结果为：

[[1 2 3]
[4 5 6]]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
[[100. 100. 100. 100. 100.]
[100. 100. 100. 100. 100.]
[100. 100. 100. 100. 100.]
[100. 100. 100. 100. 100.]
[100. 100. 100. 100. 100.]]
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
[1 3 5 7 9]
[ 0. 2.22222222 4.44444444 6.66666667 8.88888889 11.11111111
13.33333333 15.55555556 17.77777778 20. ]
[[0. 0. 0. 0.]]

数组运算

数组可实现加减乘除运算，都是对应位置单个元素之间加减乘除

arr1 = np.array([[1, 2, 3], [4, 5, 6]])  # 可变数据类型
print(arr1)
arr2 = np.array([[7, 8, 9], [10, 11, 12]])  # 可变数据类型
print(arr2)

# +-*/ // % **
print(arr1*arr2)
print(arr1+arr2)

结果为：

[[1 2 3]
[4 5 6]]
[[ 7 8 9]
[10 11 12]]
[[ 7 16 27]
[40 55 72]]
[[ 8 10 12]
[14 16 18]]

运算函数

运算函数可进行正弦余弦等运算

arr1 = np.array([[1, 2, 3], [4, 5, 6]])  # 可变数据类型
print(arr1)

print(np.sin(arr1))
print(np.cos(arr1))

print(np.sqrt(arr1))
print(np.exp(arr1))

结果为：

[[1 2 3]
[4 5 6]]
[[ 0.84147098 0.90929743 0.14112001]
[-0.7568025 -0.95892427 -0.2794155 ]]
[[ 0.54030231 -0.41614684 -0.9899925 ]
[-0.65364362 0.28366219 0.96017029]]
[[1. 1.41421356 1.73205081]
[2. 2.23606798 2.44948974]]
[[ 2.71828183 7.3890561 20.08553692]
[ 54.59815003 148.4131591 403.42879349]]

额外补充（了解）

 np.array([[1, 2, 3], [4, 5, 6]])  # 可变数据类型
print(arr1)
arr2 = np.array([[7, 8, 9], [10, 11, 12]])  # 可变数据类型
print(arr2)

print(arr1.T)
print(arr1.transpose())
# m*n × n*m = m*m
print(np.dot(arr1,arr2.T))

# 求逆
arr1 = np.array([[1, 2, 3], [4, 5, 6], [9, 8, 9]])
print(np.linalg.inv(arr1))


# numpy的数学方法(了解)
arr1 = np.array([[1, 2, 3], [4, 5, 6]])  # 可变数据类型
print(arr1)
print(arr1.var())  # 求方差
print(arr1.std())  # 求标准差
print(arr1.mean())  求平均数
print(arr1.cumsum())  # 累加和


# numpy随机数(了解)

print(np.random.rand(3,4))

print(np.random.randint(1,10,(3,4))) # 最小值1,最大值10,3*4

print(np.random.choice([1,2,3,4,5],3))

print(arr1)
np.random.shuffle(arr1)
print(arr1)

pandas模块

pandas更多的是excel/csv文件处理,excel文件, 对numpy+xlrd模块做了一层封装

pandas基于Numpy，可以看成是处理文本或者表格数据。pandas中有两个主要的数据结构，其中Series数据结构类似于Numpy中的一维数组，DataFrame类似于多维表格数据结构。

pandas是python数据分析的核心模块。它主要提供了五大功能:

支持文件存取操作，支持数据库(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
支持增删改查、切片、高阶函数、分组聚合等单表操作，以及和dict、list的互相转换。
支持多表拼接合并操作。
支持简单的绘图操作。
支持简单的统计分析操作。

import pandas as pd
import numpy as np

# Series(现在一般不使用(一维))
df = pd.Series(np.array([1,2,3,4]))
print(df)


# DataFrame(多维)

dates = pd.date_range('20190101', periods=6, freq='M')
print(dates)

values = np.random.rand(6, 4) * 10
print(values)

columns = ['c1','c2','c3','c3']

df = pd.DataFrame(values,index=dates,columns=columns)
print(df)

结果为：

0 1
1 2
2 3
3 4
dtype: int32
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
'2019-05-31', '2019-06-30'],
dtype='datetime64[ns]', freq='M')
[[0.13809596 4.43757186 8.15340748 0.14350627]
[5.33396321 1.40130474 5.38309504 8.76508077]
[9.64657086 6.46941267 5.43993509 6.98395337]
[1.94830252 3.81777031 8.96334566 0.1907509 ]
[0.48171212 9.84128458 3.78832666 0.45958052]
[4.77158079 2.09453098 9.26988158 1.08968117]]
c1 c2 c3 c3
2019-01-31 0.138096 4.437572 8.153407 0.143506
2019-02-28 5.333963 1.401305 5.383095 8.765081
2019-03-31 9.646571 6.469413 5.439935 6.983953
2019-04-30 1.948303 3.817770 8.963346 0.190751
2019-05-31 0.481712 9.841285 3.788327 0.459581
2019-06-30 4.771581 2.094531 9.269882 1.089681

pandas的属性

dtype 查看数据类型
index 查看行序列或者索引
columns 查看各列的标签
values 查看数据框内的数据，也即不含表头索引的数据
describe 查看数据每一列的极值，均值，中位数，只可用于数值型数据
transpose 转置，也可用Ｔ来操作
sort_index 排序，可按行或列index排序输出
sort_values 按数据值来排序

用表格名.函数来使用便可，这里就不多说了

series数据结构

Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签（索引）组成。

Series比较像列表（数组）和字典的结合体

import numpy as np
import pandas as pd
df = pd.Series(0, index=['a', 'b', 'c', 'd'])
print(df)

结果为：

a 0
b 0
c 0
d 0
dtype: int64

支持numpy模块的特性

详解	方法
从ndarray创建Series	Series(arr)
与标量运算	df*2
两个Series运算	df1+df2
索引	df[0], df[[1,2,4]]
切片	df[0:2]
通用函数	np.abs(df)
布尔值过滤	df[df>0]

支持字典的特性

详解	方法
从字典创建Series	Series(dic),
in运算	’a’ in sr
键索引	sr['a'], sr[['a', 'b', 'd']]

缺失数据处理

方法	详解
dropna()	过滤掉值为NaN的行
fillna()	填充缺失数据
isnull()	返回布尔数组，缺失值对应为True
notnull()	返回布尔数组，缺失值对应为False

DataFrame数据结构

data_range产生时间对象组

参数	详解
start	开始时间
end	结束时间
periods	时间长度
freq	时间频率，默认为'D'，可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…

dates = pd.date_range('20190101', periods=6, freq='M')
print(dates)  # 每月最后一天日期

values = np.random.rand(6, 4) * 10
print(values)	# 

columns = ['c4', 'c2', 'c3', 'c1']

df = pd.DataFrame(values,index=dates,columns=columns)
print(df)

c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

DataFrame属性

属性	详解
dtype	查看数据类型
index	查看行序列或者索引
columns	查看各列的标签
values	查看数据框内的数据，也即不含表头索引的数据
describe	查看数据每一列的极值，均值，中位数，只可用于数值型数据
transpose	转置，也可用Ｔ来操作
sort_index	排序，可按行或列index排序输出
sort_values	按数据值来排序

# 查看数据类型
print(df2.dtypes)
0    float64
1    float64
2    float64
3    float64
dtype: object
df

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

print(df.index)
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30'],
              dtype='datetime64[ns]', freq='M')
print(df.columns)
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
print(df.values)
[[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]
 [  8.65407629 -23.01538697  17.44811764  -7.61206901]
 [  3.19039096  -2.49370375  14.62107937 -20.60140709]
 [ -3.22417204  -3.84054355  11.33769442 -10.99891267]
 [ -1.72428208  -8.77858418   0.42213747   5.82815214]
 [-11.00619177  11.4472371    9.01590721   5.02494339]]
df.describe()

	c1	c2	c3	c4
count	6.000000	6.000000	6.000000	6.000000
mean	2.022213	-5.466424	7.927203	-6.514830
std	9.580084	11.107772	8.707171	10.227641
min	-11.006192	-23.015387	-5.281718	-20.601407
25%	-2.849200	-8.113329	2.570580	-10.931606
50%	0.733054	-4.979054	10.176801	-9.170878
75%	7.288155	-2.830414	13.800233	1.865690
max	16.243454	11.447237	17.448118	5.828152

df.T

	2019-01-31 00:00:00	2019-02-28 00:00:00	2019-03-31 00:00:00	2019-04-30 00:00:00	2019-05-31 00:00:00	2019-06-30 00:00:00
c1	16.243454	8.654076	3.190391	-3.224172	-1.724282	-11.006192
c2	-6.117564	-23.015387	-2.493704	-3.840544	-8.778584	11.447237
c3	-5.281718	17.448118	14.621079	11.337694	0.422137	9.015907
c4	-10.729686	-7.612069	-20.601407	-10.998913	5.828152	5.024943

# 按行标签[c1, c2, c3, c4]从大到小排序
df.sort_index(axis=0)

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

# 按列标签[2019-01-01, 2019-01-02...]从大到小排序
df.sort_index(axis=1)

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

# 按c2列的值从大到小排序
df.sort_values(by='c2')

	c1	c2	c3	c4
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-06-30	-11.006192	11.447237	9.015907	5.024943

DataFrame取值

df

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

通过columns取值

df['c2']
2019-01-31    -6.117564
2019-02-28   -23.015387
2019-03-31    -2.493704
2019-04-30    -3.840544
2019-05-31    -8.778584
2019-06-30    11.447237
Freq: M, Name: c2, dtype: float64
df[['c2', 'c3']]

	c2	c3
2019-01-31	-6.117564	-5.281718
2019-02-28	-23.015387	17.448118
2019-03-31	-2.493704	14.621079
2019-04-30	-3.840544	11.337694
2019-05-31	-8.778584	0.422137
2019-06-30	11.447237	9.015907

loc（通过行标签取值）

# 通过自定义的行标签选择数据
df.loc['2019-01-01':'2019-01-03']

	c1	c2	c3	c4

df[0:3]

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407

iloc（类似于numpy数组取值）

df.values
array([[ 16.24345364,  -6.11756414,  -5.28171752, -10.72968622],
       [  8.65407629, -23.01538697,  17.44811764,  -7.61206901],
       [  3.19039096,  -2.49370375,  14.62107937, -20.60140709],
       [ -3.22417204,  -3.84054355,  11.33769442, -10.99891267],
       [ -1.72428208,  -8.77858418,   0.42213747,   5.82815214],
       [-11.00619177,  11.4472371 ,   9.01590721,   5.02494339]])
# 通过行索引选择数据
print(df.iloc[2, 1])
-2.493703754774101
df.iloc[1:4, 1:4]

	c2	c3	c4
2019-02-28	-23.015387	17.448118	-7.612069
2019-03-31	-2.493704	14.621079	-20.601407
2019-04-30	-3.840544	11.337694	-10.998913

使用逻辑判断取值

df[df['c1'] > 0]

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407

df[(df['c1'] > 0) & (df['c2'] > -8)]

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-03-31	3.190391	-2.493704	14.621079	-20.601407

DataFrame值替换

df

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

df.iloc[0:3, 0:2] = 0
df

	c1	c2	c3	c4
2019-01-31	0.000000	0.000000	-5.281718	-10.729686
2019-02-28	0.000000	0.000000	17.448118	-7.612069
2019-03-31	0.000000	0.000000	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

df['c3'] > 10
2019-01-31    False
2019-02-28     True
2019-03-31     True
2019-04-30     True
2019-05-31    False
2019-06-30    False
Freq: M, Name: c3, dtype: bool
# 针对行做处理
df[df['c3'] > 10] = 100
df

	c1	c2	c3	c4
2019-01-31	0.000000	0.000000	-5.281718	-10.729686
2019-02-28	100.000000	100.000000	100.000000	100.000000
2019-03-31	100.000000	100.000000	100.000000	100.000000
2019-04-30	100.000000	100.000000	100.000000	100.000000
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

# 针对行做处理
df = df.astype(np.int32)
df[df['c3'].isin([100])] = 1000
df

	c1	c2	c3	c4
2019-01-31	0	0	-5	-10
2019-02-28	1000	1000	1000	1000
2019-03-31	1000	1000	1000	1000
2019-04-30	1000	1000	1000	1000
2019-05-31	-1	-8	0	5
2019-06-30	-11	11	9	5

读取CSV文件

import pandas as pd
from io import StringIO
test_data = '''
5.1,,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,,0.2
7.0,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,
,,,
'''

test_data = StringIO(test_data)
df = pd.read_csv(test_data, header=None)
df.columns = ['c1', 'c2', 'c3', 'c4']
df

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN
6	NaN	NaN	NaN	NaN

处理丢失数据

df.isnull()

	c1	c2	c3	c4
0	False	True	False	False
1	False	False	False	False
2	False	False	True	False
3	False	False	False	False
4	False	False	False	False
5	False	False	False	True
6	True	True	True	True

# 通过在isnull()方法后使用sum()方法即可获得该数据集某个特征含有多少个缺失值
print(df.isnull().sum())
c1    1
c2    2
c3    2
c4    2
dtype: int64
# axis=0删除有NaN值的行
df.dropna(axis=0)

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5

# axis=1删除有NaN值的列
df.dropna(axis=1)


0
1
2
3
4
5
6

# 删除全为NaN值得行或列
df.dropna(how='all')

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

# 删除行不为4个值的
df.dropna(thresh=4)

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5

# 删除c2中有NaN值的行
df.dropna(subset=['c2'])

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

# 填充nan值
df.fillna(value=10)

	c1	c2	c3	c4
0	5.1	10.0	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	10.0	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	10.0
6	10.0	10.0	10.0	10.0

合并数据

df1 = pd.DataFrame(np.zeros((3, 4)))
df1

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0

df2 = pd.DataFrame(np.ones((3, 4)))
df2

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

# axis=0合并列
pd.concat((df1, df2), axis=0)

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

# axis=1合并行
pd.concat((df1, df2), axis=1)

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

# append只能合并列
df1.append(df2)

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

导入导出数据

使用df = pd.read_excel(filename)读取文件，使用df.to_excel(filename)保存文件。

9.1 读取文件导入数据

读取文件导入数据函数主要参数：

参数	详解
sep	指定分隔符，可用正则表达式如's+'
header=None	指定文件无行名
name	指定列名
index_col	指定某列作为索引
skip_row	指定跳过某些行
na_values	指定某些字符串表示缺失值
parse_dates	指定某些列是否被解析为日期，布尔值或列表

df = pd.read_excel(filename)
df = pd.read_csv(filename)

写入文件导出数据

写入文件函数的主要参数：

参数	详解
sep	分隔符
na_rep	指定缺失值转换的字符串，默认为空字符串
header=False	不保存列名
index=False	不保存行索引
cols	指定输出的列，传入列表

df.to_excel(filename)

pandas读取json文件

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'

df = pd.read_json(strtext, orient='records')
df

	code	code1	code2	issue	time	ttery
0	8,4,5,2,9	297734529	NaN	20130801-3391	1013395466000	min
1	7,8,2,1,2	298058212	NaN	20130801-3390	1013395406000	min
2	5,9,1,2,9	298329129	NaN	20130801-3389	1013395346000	min
3	3,8,7,3,3	298588733	NaN	20130801-3388	1013395286000	min
4	0,8,5,2,7	298818527	NaN	20130801-3387	1013395226000	min

df.to_excel('pandas处理json.xlsx',
            index=False,
            columns=["ttery", "issue", "code", "code1", "code2", "time"])

orient参数的五种形式

orient是表明预期的json字符串格式。orient的设置有以下五个值：

1.'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

这种就是有索引，有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。

s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
df = pd.read_json(s, orient='split')
df

	a	b
1	1	3
2	2	8
3	3	9

2.'records' : list like [{column -> value}, ... , {column -> value}]

这种就是成员为字典的列表。如我今天要处理的json数据示例所见。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}]'

df = pd.read_json(strtext, orient='records')
df

	code	code1	code2	issue	time	ttery
0	8,4,5,2,9	297734529	NaN	20130801-3391	1013395466000	min
1	7,8,2,1,2	298058212	NaN	20130801-3390	1013395406000	min

3.'index' : dict like {index -> {column -> value}}

以索引为key,以列字段构成的字典为键值。如：

s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
df = pd.read_json(s, orient='index')
df

	a	b
0	1	2
1	9	11

4.'columns' : dict like {column -> {index -> value}}

这种处理的就是以列为键，对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:

s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
df = pd.read_json(s, orient='columns')
df

	a	b
0	1	2
1	9	11

5.'values' : just the values array。

values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表，2层的。

s = '[["a",1],["b",2]]'
df = pd.read_json(s, orient='values')
df

	0	1
0	a	1
1	b	2

pandas读取sql语句

import numpy as np
import pandas as pd
import pymysql


def conn(sql):
    # 连接到mysql数据库
    conn = pymysql.connect(
        host="localhost",
        port=3306,
        user="root",
        passwd="123",
        db="db1",
    )
    try:
        data = pd.read_sql(sql, con=conn)
        return data
    except Exception as e:
        print("SQL is not correct!")
    finally:
        conn.close()


sql = "select * from test1 limit 0, 10"  # sql语句
data = conn(sql)
print(data.columns.tolist())  # 查看字段
print(data)  # 查看数据

matplotlib

matplotlib是一个绘图库，它可以创建常用的统计图，包括条形图、箱型图、折线图、散点图、饼图和直方图。

条形图

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

# 修改背景为条纹
plt.style.use('ggplot')

classes = ['3班', '4班', '5班', '6班']

classes_index = range(len(classes))
print(list(classes_index))
[0, 1, 2, 3]
student_amounts = [66, 55, 45, 70]

# 画布设置
fig = plt.figure()
# 1,1,1表示一张画布切割成1行1列共一张图的第1个；2,2,1表示一张画布切割成2行2列共4张图的第一个（左上角）
ax1 = fig.add_subplot(1, 1, 1)
ax1.bar(classes_index, student_amounts, align='center', color='darkblue')
ax1.xaxis.set_ticks_position('bottom')
ax1.yaxis.set_ticks_position('left')

plt.xticks(classes_index,
           classes,
           rotation=0,
           fontsize=13,
           fontproperties=font)
plt.xlabel('班级', fontproperties=font, fontsize=15)
plt.ylabel('学生人数', fontproperties=font, fontsize=15)
plt.title('班级-学生人数', fontproperties=font, fontsize=20)
# 保存图片，bbox_inches='tight'去掉图形四周的空白
# plt.savefig('classes_students.png?x-oss-process=style/watermark', dpi=400, bbox_inches='tight')
plt.show()

直方图

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

# 修改背景为条纹
plt.style.use('ggplot')

mu1, mu2, sigma = 50, 100, 10
# 构造均值为50的符合正态分布的数据
x1 = mu1 + sigma * np.random.randn(10000)
print(x1)
[59.00855949 43.16272141 48.77109774 ... 57.94645859 54.70312714
 58.94125528]
# 构造均值为100的符合正态分布的数据
x2 = mu2 + sigma * np.random.randn(10000)
print(x2)
[115.19915511  82.09208214 110.88092454 ...  95.0872103  104.21549068
 133.36025251]
fig = plt.figure()
ax1 = fig.add_subplot(121)
# bins=50表示每个变量的值分成50份，即会有50根柱子
ax1.hist(x1, bins=50, color='darkgreen')

ax2 = fig.add_subplot(122)
ax2.hist(x2, bins=50, color='orange')

fig.suptitle('两个正态分布', fontproperties=font, fontweight='bold', fontsize=15)
ax1.set_title('绿色的正态分布', fontproperties=font)
ax2.set_title('橙色的正态分布', fontproperties=font)
plt.show()

折线图

import numpy as np
from numpy.random import randn
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

# 修改背景为条纹
plt.style.use('ggplot')

np.random.seed(1)

# 使用numpy的累加和，保证数据取值范围不会在（0，1）内波动
plot_data1 = randn(40).cumsum()
print(plot_data1)
[ 1.62434536  1.01258895  0.4844172  -0.58855142  0.2768562  -2.02468249
 -0.27987073 -1.04107763 -0.72203853 -0.97140891  0.49069903 -1.56944168
 -1.89185888 -2.27591324 -1.1421438  -2.24203506 -2.41446327 -3.29232169
 -3.25010794 -2.66729273 -3.76791191 -2.6231882  -1.72159748 -1.21910314
 -0.31824719 -1.00197505 -1.12486527 -2.06063471 -2.32852279 -1.79816732
 -2.48982807 -2.8865816  -3.5737543  -4.41895994 -5.09020607 -5.10287067
 -6.22018102 -5.98576532 -4.32596314 -3.58391898]
plot_data2 = randn(40).cumsum()
plot_data3 = randn(40).cumsum()
plot_data4 = randn(40).cumsum()

plt.plot(plot_data1, marker='o', color='red', linestyle='-', label='红实线')
plt.plot(plot_data2, marker='x', color='orange', linestyle='--', label='橙虚线')
plt.plot(plot_data3, marker='*', color='yellow', linestyle='-.', label='黄点线')
plt.plot(plot_data4, marker='s', color='green', linestyle=':', label='绿点图')

# loc='best'给label自动选择最好的位置
plt.legend(loc='best', prop=font)
plt.show()

散点图+直线图

import numpy as np
from numpy.random import randn
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

# 修改背景为条纹
plt.style.use('ggplot')

x = np.arange(1, 20, 1)
print(x)
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
# 拟合一条水平散点线
np.random.seed(1)
y_linear = x + 10 * np.random.randn(19)
print(y_linear)
[ 17.24345364  -4.11756414  -2.28171752  -6.72968622  13.65407629
 -17.01538697  24.44811764   0.38793099  12.19039096   7.50629625
  25.62107937  -8.60140709   9.77582796  10.15945645  26.33769442
   5.00108733  15.27571792   9.22141582  19.42213747]
# 拟合一条x²的散点线
y_quad = x**2 + 10 * np.random.randn(19)
print(y_quad)
[  6.82815214  -7.00619177  20.4472371   25.01590721  30.02494339
  45.00855949  42.16272141  62.77109774  71.64230566  97.3211192
 126.30355467 137.08339248 165.03246473 189.128273   216.54794359
 249.28753869 288.87335401 312.82689651 363.34415698]
# s是散点大小
fig = plt.figure()
ax1 = fig.add_subplot(121)
plt.scatter(x, y_linear, s=30, color='r', label='蓝点')
plt.scatter(x, y_quad, s=100, color='b', label='红点')

ax2 = fig.add_subplot(122)
plt.plot(x, y_linear, color='r')
plt.plot(x, y_quad, color='b')

# 限制x轴和y轴的范围取值
plt.xlim(min(x) - 1, max(x) + 1)
plt.ylim(min(y_quad) - 10, max(y_quad) + 10)
fig.suptitle('散点图+直线图', fontproperties=font, fontsize=20)
ax1.set_title('散点图', fontproperties=font)
ax1.legend(prop=font)
ax2.set_title('直线图', fontproperties=font)
plt.show()

饼图

import numpy as np
import matplotlib.pyplot as plt
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']

fig, ax = plt.subplots(subplot_kw=dict(aspect="equal"))

recipe = ['优', '良', '轻度污染', '中度污染', '重度污染', '严重污染', '缺']

data = [2, 49, 21, 9, 11, 6, 2]
colors = ['lime', 'yellow', 'darkorange', 'red', 'purple', 'maroon', 'grey']
wedges, texts, texts2 = ax.pie(data,
                               wedgeprops=dict(width=0.5),
                               startangle=40,
                               colors=colors,
                               autopct='%1.0f%%',
                               pctdistance=0.8)
plt.setp(texts2, size=14, weight="bold")

bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
kw = dict(xycoords='data',
          textcoords='data',
          arrowprops=dict(arrowstyle="->"),
          bbox=None,
          zorder=0,
          va="center")

for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1) / 2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
    connectionstyle = "angle,angleA=0,angleB={}".format(ang)
    kw["arrowprops"].update({"connectionstyle": connectionstyle})
    ax.annotate(recipe[i],
                xy=(x, y),
                xytext=(1.25 * np.sign(x), 1.3 * y),
                size=16,
                horizontalalignment=horizontalalignment,
                fontproperties=font,
                **kw)

ax.set_title("饼图示例",fontproperties=font)

plt.show()
# plt.savefig('jiaopie2.png?x-oss-process=style/watermark')

箱型图

箱型图：又称为盒须图、盒式图、盒状图或箱线图，是一种用作显示一组数据分散情况资料的统计图（在数据分析中常用在异常值检测）

包含一组数据的：最大值、最小值、中位数、上四分位数（Q3）、下四分位数（Q1）、异常值

中位数 → 一组数据平均分成两份，中间的数
上四分位数Q1 → 是将序列平均分成四份，计算(n+1)/4与(n-1)/4两种，一般使用(n+1)/4
下四分位数Q3 → 是将序列平均分成四份，计算(1+n)/4*3=6.75
内限 → T形的盒须就是内限，最大值区间Q3+1.5IQR,最小值区间Q1-1.5IQR （IQR=Q3-Q1）
外限 → T形的盒须就是内限，最大值区间Q3+3IQR,最小值区间Q1-3IQR （IQR=Q3-Q1）
异常值 → 内限之外 - 中度异常，外限之外 - 极度异常

import numpy as np
import pandas as pd
from numpy.random import randn
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
plt.figure(figsize=(10, 4))
# 创建图表、数据

f = df.boxplot(
    sym='o',  # 异常点形状，参考marker
    vert=True,  # 是否垂直
    whis=1.5,  # IQR，默认1.5，也可以设置区间比如[5,95]，代表强制上下边缘为数据95%和5%位置
    patch_artist=True,  # 上下四分位框内是否填充，True为填充
    meanline=False,
    showmeans=True,  # 是否有均值线及其形状
    showbox=True,  # 是否显示箱线
    showcaps=True,  # 是否显示边缘线
    showfliers=True,  # 是否显示异常值
    notch=False,  # 中间箱体是否缺口
    return_type='dict'  # 返回类型为字典
)
plt.title('boxplot')

for box in f['boxes']:
    box.set(color='b', linewidth=1)  # 箱体边框颜色
    box.set(facecolor='b', alpha=0.5)  # 箱体内部填充颜色
for whisker in f['whiskers']:
    whisker.set(color='k', linewidth=0.5, linestyle='-')
for cap in f['caps']:
    cap.set(color='gray', linewidth=2)
for median in f['medians']:
    median.set(color='DarkBlue', linewidth=2)
for flier in f['fliers']:
    flier.set(marker='o', color='y', alpha=0.5)
# boxes, 箱线
# medians, 中位值的横线,
# whiskers, 从box到error bar之间的竖线.
# fliers, 异常值
# caps, error bar横线
# means, 均值的横线

plot函数参数

线型linestyle（-,-.,--,..）
点型marker（v,^,s,*,H,+,x,D,o,…）
颜色color（b,g,r,y,k,w,…）

图像标注参数

设置图像标题：plt.title()
设置x轴名称：plt.xlabel()
设置y轴名称：plt.ylabel()
设置X轴范围：plt.xlim()
设置Y轴范围：plt.ylim()
设置X轴刻度：plt.xticks()
设置Y轴刻度：plt.yticks()
设置曲线图例：plt.legend()

Matplolib应用

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline

# 找到自己电脑的字体路径，然后修改字体路径
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

header_list = ['方程组', '函数', '导数', '微积分', '线性代数', '概率论', '统计学']
py3_df = pd.read_excel('py3.xlsx', header=None,
                       skiprows=[0, 1], names=header_list)
# 处理带有NaN的行
py3_df = py3_df.dropna(axis=0)
print(py3_df)

# 自定义映射
map_dict = {
    '不会': 0,
    '了解': 1,
    '熟悉': 2,
    '使用过': 3,
}

for header in header_list:
    py3_df[header] = py3_df[header].map(map_dict)

unable_series = (py3_df == 0).sum(axis=0)
know_series = (py3_df == 1).sum(axis=0)
familiar_series = (py3_df == 2).sum(axis=0)
use_series = (py3_df == 3).sum(axis=0)

unable_label = '不会'
know_label = '了解'
familiar_label = '熟悉'
use_label = '使用过'
for i in range(len(header_list)):
    bottom = 0

    # 描绘不会的条形图
    plt.bar(x=header_list[i], height=unable_series[i],
            width=0.60, color='r', label=unable_label)
    if unable_series[i] != 0:
        plt.text(header_list[i], bottom, s=unable_series[i],
                 ha='center', va='bottom', fontsize=15, color='white')
    bottom += unable_series[i]

    # 描绘了解的条形图
    plt.bar(x=header_list[i], height=know_series[i],
            width=0.60, color='y', bottom=bottom, label=know_label)
    if know_series[i] != 0:
        plt.text(header_list[i], bottom, s=know_series[i],
                 ha='center', va='bottom', fontsize=15, color='white')
    bottom += know_series[i]

    # 描绘熟悉的条形图
    plt.bar(x=header_list[i], height=familiar_series[i],
            width=0.60, color='g', bottom=bottom, label=familiar_label)
    if familiar_series[i] != 0:
        plt.text(header_list[i], bottom, s=familiar_series[i],
                 ha='center', va='bottom', fontsize=15, color='white')
    bottom += familiar_series[i]

    # 描绘使用过的条形图
    plt.bar(x=header_list[i], height=use_series[i],
            width=0.60, color='b', bottom=bottom, label=use_label)
    if use_series[i] != 0:
        plt.text(header_list[i], bottom, s=use_series[i],
                 ha='center', va='bottom', fontsize=15, color='white')

    unable_label = know_label = familiar_label = use_label = ''

plt.xticks(header_list, fontproperties=font)
plt.ylabel('人数', fontproperties=font)
plt.title('Python3期数学摸底可视化', fontproperties=font)
plt.legend(prop=font, loc='upper left')
plt.show()
    方程组   函数   导数        微积分       线性代数  概率论  统计学
0   使用过  使用过   不会         不会         不会   不会   不会
1   使用过  使用过   了解         不会         不会   不会   不会
2   使用过  使用过   熟悉         不会         不会   不会   不会
3    熟悉   熟悉   熟悉         了解         了解   了解   了解
4   使用过  使用过  使用过        使用过        使用过  使用过  使用过
5   使用过  使用过  使用过         不会         不会   不会   了解
6    熟悉   熟悉   熟悉         熟悉         熟悉   熟悉   不会
7   使用过  使用过  使用过        使用过        使用过  使用过  使用过
8    熟悉   熟悉   熟悉         熟悉         熟悉  使用过  使用过
9    熟悉   熟悉  使用过         不会        使用过  使用过   不会
10  使用过  使用过   熟悉         熟悉         熟悉   熟悉   熟悉
11  使用过  使用过  使用过        使用过        使用过   不会   不会
12  使用过  使用过  使用过        使用过        使用过  使用过  使用过
13  使用过  使用过   了解         不会         不会   不会   不会
14  使用过  使用过  使用过        使用过        使用过   不会   不会
15  使用过  使用过   熟悉         不会         不会   不会   不会
16   熟悉   熟悉  使用过        使用过        使用过   不会   不会
17  使用过  使用过  使用过         了解         不会   不会   不会
18  使用过  使用过  使用过        使用过         熟悉   熟悉   熟悉
19  使用过  使用过  使用过         了解         不会   不会   不会
20  使用过  使用过  使用过        使用过        使用过  使用过  使用过
21  使用过  使用过  使用过        使用过        使用过  使用过  使用过
22  使用过  很了解   熟悉  了解一点，不会运用  了解一点，不会运用   了解   不会
23  使用过  使用过  使用过        使用过         熟悉  使用过   熟悉
24   熟悉   熟悉   熟悉        使用过         不会   不会   不会
25  使用过  使用过  使用过        使用过        使用过  使用过  使用过
26  使用过  使用过  使用过        使用过        使用过   不会   不会
27  使用过  使用过   不会         不会         不会   不会   不会
28  使用过  使用过  使用过        使用过        使用过  使用过   了解
29  使用过  使用过  使用过        使用过        使用过   了解   不会
30  使用过  使用过  使用过        使用过        使用过   不会   不会
31  使用过  使用过  使用过        使用过         不会  使用过  使用过
32   熟悉   熟悉  使用过        使用过        使用过   不会   不会
33  使用过  使用过  使用过        使用过         熟悉  使用过   熟悉
34   熟悉   熟悉   熟悉        使用过        使用过   熟悉   不会
35  使用过  使用过  使用过        使用过        使用过  使用过  使用过
36  使用过  使用过  使用过        使用过        使用过  使用过   了解
37  使用过  使用过  使用过        使用过        使用过   不会   不会
38  使用过  使用过  使用过         不会         不会   不会   不会
39  使用过  使用过   不会         不会         不会   不会   不会
40  使用过  使用过  使用过        使用过        使用过   不会   不会
41  使用过  使用过   熟悉         了解         了解   了解   不会
42  使用过  使用过  使用过         不会         不会   不会   不会
43   熟悉  使用过   了解         了解         不会   不会   不会

typing模块

提供了Generator,Iterable,Iterator三种数据类型,限制函数

from typing import Generator,Iterable,Iterator

#          参数的数据类型                                                             返回值
def func(i: int, f: float, b: bool, lt: list, tup: tuple, dic: dict,g:Generator) -> tuple:
    lis = [i, f, b, lt, tup, dic]
    return tuple(lis)

# i, f, b, lt, tup, dic = func(1,2,3,4,5,6) # 不错误,只是不规范
res = func(1, 2, True, [1, 2], (1, 2), {'a': 1},ger())
print(res)

所以当输入的值不等于规定的值的时候，并不会报错，它其实是相当于一个提醒的功能

在之前的版本，不止这三种数据类型，其余的数据类型也要用typing，但是随着Python的改进，七种常用的数据类型都可以直接使用了，所以typing就显得不这么重要了

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN
6	NaN	NaN	NaN	NaN

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN
6	NaN	NaN	NaN	NaN

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

更多的常用模块

numpy模块

numpy的属性

数组的转置

数组元素的数据类型及类型转换

数组元素个数及维度大小

数组的维数

切片

修改值

合并

通过函数创建numpy数组

数组运算

运算函数

额外补充（了解）

pandas模块

pandas的属性

series数据结构

支持numpy模块的特性

支持字典的特性

缺失数据处理

DataFrame数据结构

data_range产生时间对象组

DataFrame属性

DataFrame取值

通过columns取值

loc（通过行标签取值）

iloc（类似于numpy数组取值）

使用逻辑判断取值

DataFrame值替换

读取CSV文件

处理丢失数据

合并数据

导入导出数据

写入文件导出数据

pandas读取json文件

orient参数的五种形式

pandas读取sql语句

matplotlib

条形图

直方图

折线图

散点图+直线图

饼图

箱型图

plot函数参数

图像标注参数

Matplolib应用

typing模块

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN
6	NaN	NaN	NaN	NaN

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0