[机器学习笔记] 1监督学习

第一章监督学习

1.1 准备工作

如果你是在windows环境下，建议直接使用anaconda，这里里面集成了一些常用的Python库。

如果是在其他环境下，就更方便了，保证这下面几个已经安装就好了。

NumPy： http://docs.scipy.org/doc/numpy-1.10.1/user/install.html
SciPy： http://www.scipy.org/install.html
scikit-learn： http://scikit-learn.org/stable/install.html
matplotlib： http://matplotlib.org/1.4.2/users/installing.html

1.2 数据预处理

其实在机器学习的整个过程中，数据预处理的过程是最麻烦和繁琐的，同样对后面的结果也会产生很大的影响。一定要重视！！！

均值移除

Standardization即标准化，尽量将数据转化为均值为零，方差为一的数据，形如标准正态分布（高斯分布）
scale 零均值单位方差

import numpy as np
from sklearn import preprocessing

data = np.array([[3, -1.5, 2, -5.4], [0, 4, -0.3, 2.1], [1, 3.3, -1.9, -4.3]])
data

array([[ 3. , -1.5,  2. , -5.4],
       [ 0. ,  4. , -0.3,  2.1],
       [ 1. ,  3.3, -1.9, -4.3]])

data_standardized = preprocessing.scale(data)

print("Mean = ", data_standardized.mean(axis = 0))
print("Std deviation = ", data_standardized.std(axis = 0))

data_standardized

Mean =  [ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]
Std deviation =  [1. 1. 1. 1.]

array([[ 1.33630621, -1.40451644,  1.29110641, -0.86687558],
       [-1.06904497,  0.84543708, -0.14577008,  1.40111286],
       [-0.26726124,  0.55907936, -1.14533633, -0.53423728]])

范围缩放Scaling

data_scaler = preprocessing.MinMaxScaler(feature_range = (0, 1))
data_scaled = data_scaler.fit_transform(data)
data_scaled

array([[1.        , 0.        , 1.        , 0.        ],
       [0.        , 1.        , 0.41025641, 1.        ],
       [0.33333333, 0.87272727, 0.        , 0.14666667]])

归一化

data_normalized = preprocessing.normalize(data, norm = 'l1')
data_normalized

array([[ 0.25210084, -0.12605042,  0.16806723, -0.45378151],
       [ 0.        ,  0.625     , -0.046875  ,  0.328125  ],
       [ 0.0952381 ,  0.31428571, -0.18095238, -0.40952381]])

二值化

data_binarized = preprocessing.Binarizer(threshold = 2).transform(data)
data_binarized

array([[1., 0., 0., 0.],
       [0., 1., 0., 1.],
       [0., 1., 0., 0.]])

独热编码（one-hot-encoding）

encoder = preprocessing.OneHotEncoder()
# 给数据进去，根据每列数据得到编码值
encoder.fit([
    [0, 2, 1, 12],
    [1, 3, 5, 3],
    [2, 3, 2, 12],
    [1, 2, 4, 3]
])

encoded_vector = encoder.transform([ [2, 3, 5, 3] ]).toarray()

encoded_vector

array([[0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0.]])

结果分析

encoder.fit 训练数据
第一列[0, 1, 2, 1]得到的3类特征值[0, 1, 2]，独热编码表示为：[100, 010, 001]
第二列[2, 3, 3, 2]得到的2类特征值[2, 3]，独热编码表示为：[10, 01]
第三列[12, 3, 12, 3]得到的2类特征值[3, 12]，独热编码表示为：[10, 01]
当输入[2, 3, 5, 3]时，第一个2就对应[001]，以此类推可得。

1.3 定义一个编码器

import numpy as np
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

input_classes = ['audi', 'ford', 'audi', 'toyota', 'ford', 'bmw']

label_encoder.fit(input_classes)

for i, item in enumerate(label_encoder.classes_):
    print(item, '-->', i)

audi --> 0
bmw --> 1
ford --> 2
toyota --> 3

labels = ['toyota', 'ford', 'audi']
encoded_labels = label_encoder.transform(labels)
print("labels = ", labels)
print("encoded_labels = ", encoded_labels)

labels =  ['toyota', 'ford', 'audi']
encoded_labels =  [3 2 0]

逆向操作，根据数字得到原始的字串

encoded_labels = [2, 1, 0, 3, 1]
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print("encoded_labels = ", encoded_labels)
print("decoded_labels = ", decoded_labels)

encoded_labels =  [2, 1, 0, 3, 1]
decoded_labels =  ['ford' 'bmw' 'audi' 'toyota' 'bmw']

1.4 创建线性回归器

import sys
import numpy as np

读取文件中数据

X 表示数据
Y 表示标记

filename = "data_singlevar.txt"
X = []
y = []
with open(filename, 'r') as f:
    for line in f.readlines():
        xt, yt = [float(i) for i in line.split(',')]
        X.append(xt)
        y.append(yt)

将数据分为训练数据集、测试数据集

用80%的数据作为训练数据集，20%的数据作为测试数据集

num_training = int(0.8 * len(X))
num_test = len(X) - num_training

# train data
X_train = np.array(X[:num_training]).reshape((num_training, 1))
y_train = np.array(y[:num_training])

# test_data
X_test = np.array(X[num_training:]).reshape((num_test, 1))
y_test = np.array(y[num_training:])

创建回归器对象

from sklearn import linear_model

# 创建线性回归对象
linear_regressor = linear_model.LinearRegression()

# 用训练数据集训练样本
linear_regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

数据拟合

import matplotlib.pyplot as plt

# 在jupyter中直接显示图形
#%matplotlib inline

y_train_pred = linear_regressor.predict(X_train)
plt.figure()
plt.scatter(X_train, y_train, color = 'green')
plt.plot(X_train, y_train_pred, color='red', linewidth=4)
plt.title('Training data')
plt.show()

y_train_pred

array([4.850913  , 2.29390029, 1.16834408, 0.5369345 , 2.43508504,
       1.52130596, 3.05472923, 1.64288172, 3.4273001 , 3.76457478,
       4.06655328, 2.552739  , 2.5566608 , 3.39984751, 3.52534506,
       1.28991984, 4.38421897, 4.54109091, 3.04296383, 4.25087781,
       3.80379277, 3.93321212, 3.32925513, 3.32141154, 3.9881173 ,
       2.63509677, 1.83504985, 3.1292434 , 1.56052395, 3.34102053,
       3.88222874, 0.42320234, 3.63123363, 2.64686217, 1.4114956 ,
       2.11741935, 4.14106745, 3.27434995, 4.49010753, 4.43912415])

将测试数据放入模型进行预测，查看效果

y_test_pred = linear_regressor.predict(X_test)

plt.scatter(X_test, y_test, color = 'green')
plt.plot(X_test, y_test_pred, color = 'red', linewidth = 2)
plt.title('Test Data')

Text(0.5, 1.0, 'Test Data')

1.5 计算回归准确性

准确性评估

平均绝对误差（mean absolute error）:这是给定数据集的所有数据点的绝对误差平均值。
均方误差（mean squared error）：所有数据点的误差的平方的平均值。最常用。
中位数绝对误差（median absolute error）：左右数据点的误差的中位数。
解释方差分（explained variance score）：用来衡量我们的模型对数据集波动的解释能力。得分1.0表示模式是完美的。
R方得分（R2 score）：这个指标读作“R方”，是指确定性相关系数，用于衡量模型对未知的样本预测的效果。最好得分是1.0。

import sklearn.metrics as sm

print("Mean absolute error = ", round(sm.mean_absolute_error(y_test, y_test_pred), 2))

print("Mean squared error = ", round(sm.mean_squared_error(y_test, y_test_pred), 2))

print("Median absolute error = ", round(sm.median_absolute_error(y_test, y_test_pred), 2))

print("Explained variance score = ", round(sm.explained_variance_score(y_test, y_test_pred), 2))

print ("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))

Mean absolute error =  0.54
Mean squared error =  0.38
Median absolute error =  0.54
Explained variance score =  0.68
R2 score = 0.68

通常尽量保持均方误差最低，解释方差分最高

1.6 保存模型数据

保存模型数据

import pickle as pickle

output_model_file = 'saved_model.pkl'
with open(output_model_file, 'wb') as f:
    pickle.dump(linear_regressor, f)

加载模型数据

with open(output_model_file, 'rb') as f:
    model_linregr = pickle.load(f)
    
y_test_pred_new = model_linregr.predict(X_test)
print("", round(sm.mean_absolute_error(y_test, y_test_pred_new), 2))

 0.54