机器学习100天-day1数据预处理

机器学习一百天学习笔记(只作为学习笔记,具体推荐学习如下链接)

原github:
https://github.com/Avik-Jain/100-Days-Of-ML-Code

中文版地址:

https://github.com/MLEveryday/100-Days-Of-ML-Code

参考:作者做了笔记,以及添加了一些相关知识,作为主要参考

https://blog.csdn.net/ssswill/article/details/86022051

不知道博客园怎么上传文件,不然可以直接把代码上传上来,比较方便

day1数据预处理

在这里插入图片描述

一,导入库

import numpy as np
import pandas as pd

二,导入数据集

#导入数据集
dataset = pd.read_csv('D:\100DaysdatasetsData.csv')
#print(dataset)
X = dataset.iloc[ : , :-1].values
print(X)
#提取列数据,前面的:表示提取所有行,后面的为切片,提取1到倒数第二列
Y = dataset.iloc[ : , 3].values
print(Y)
View Code

  loc是显式索引,iloc是隐式索引(左闭右开),

  .values把dataframe转为np数组,X,Y为

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

  把df转为np二维数组

三,处理丢失数据

  方法一:按照原网站方法

#处理丢失数据
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
imputer = imputer.fit(X[ : ,1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])

   处理之后的结果

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

 关于sklearn处理缺失值

在这里插入图片描述

如果读入数据时不加values,即为以dataframe的形式处理,也可以Imouter,如下:

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
# imputer = imputer.fit(X.iloc[ : , 1:3])
# X.iloc[ : , 1:3] = imputer.transform(X.iloc[ : , 1:3])
X.iloc[ : , 1:3] = imputer.fit_transform(X.iloc[ : , 1:3])
print(X)
   Country        Age        Salary
0   France  44.000000  72000.000000
1    Spain  27.000000  48000.000000
2  Germany  30.000000  54000.000000
3    Spain  38.000000  61000.000000
4  Germany  40.000000  63777.777778
5   France  35.000000  58000.000000
6    Spain  38.777778  52000.000000
7   France  48.000000  79000.000000
8  Germany  50.000000  83000.000000
9   France  37.000000  67000.000000

  方法二:

利用pandas的fillna函数,一直以来用的也是这个方法

df['Age'].fillna(df['Age'].mean(),inplace=True)

四,解析分类数据

 主要是LabelEncoder和OneHotEncoder

在这里插入图片描述

pandas的独热编码,get_dummies:

 在这里插入图片描述

详细链接:https://blog.csdn.net/lujiandong1/article/details/52836051

代码:

#解析分类数据(labelencoder)
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : ,0] = labelencoder_X.fit_transform(X[ : , 0])

#创建虚拟变量(onehotencoder)
onehotencoder = OneHotEncoder(categorical_features= [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

输出(print(X)):

[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
  5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
  8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
  6.70000000e+04]]

这里一开始没有明白为什么列数变了,其实前面的n列都是独热编码的内容,编码成几位就有几列,如例中就是被编码成了100,001,010,001等

做到这基本上对数据预处理有两种途径,一种是用numpy形式处理,一种是用dataframe形式处理,以前一直也用的pandas的df,感觉目前进行的操作来说df比较方便快捷

五,拆分数据集为训练集和测试集

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
print(X_train)
print(Y_train)

六,特征量化

(特征标准化?)

#特征量化
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.fit_transform(X_test)

原文地址:https://www.cnblogs.com/1113127139aaa/p/10271356.html