task4 建模与调参打卡

Datawhale 零基础入门数据挖掘-Task4 建模调参¶
四、建模与调参

4.1 学习目标
了解常用的机器学习模型，并掌握机器学习模型的建模与调参流程
完成相应学习打卡任务
4.2 内容介绍
线性回归模型：
线性回归对于特征的要求；
处理长尾分布；
理解线性回归模型；
模型性能验证：
评价函数与目标函数；
交叉验证方法；
留一验证方法；
针对时间序列问题的验证；
绘制学习率曲线；
绘制验证曲线；
嵌入式特征选择：
Lasso回归；
Ridge回归；
决策树；
模型对比：
常用线性模型；
常用非线性模型；
模型调参：
贪心调参方法；
网格调参方法；
贝叶斯调参方法；
4.3 相关原理介绍与推荐
由于相关算法原理篇幅较长，本文推荐了一些博客与教材供初学者们进行学习。

4.3.1 线性回归模型
https://zhuanlan.zhihu.com/p/49480391

4.3.2 决策树模型
https://zhuanlan.zhihu.com/p/65304798

4.3.3 GBDT模型
https://zhuanlan.zhihu.com/p/45145899

4.3.4 XGBoost模型
https://zhuanlan.zhihu.com/p/86816771

4.3.5 LightGBM模型
https://zhuanlan.zhihu.com/p/89360721

4.3.6 推荐教材：
《机器学习》 https://book.douban.com/subject/26708119/
《统计学习方法》 https://book.douban.com/subject/10590856/
《Python大战机器学习》 https://book.douban.com/subject/26987890/
《面向机器学习的特征工程》 https://book.douban.com/subject/26826639/
《数据科学家访谈录》 https://book.douban.com/subject/30129410/

代码示例
4.4.1 读取数据

1
import pandas as pd
2
import numpy as np
3
import warnings
4
warnings.filterwarnings('ignore')
reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间

1
def reduce_mem_usage(df):
2
""" iterate through all the columns of a dataframe and modify the data type
3
to reduce memory usage.
4
"""
5
start_mem = df.memory_usage().sum()
6
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
7

8
for col in df.columns:
9
col_type = df[col].dtype
10

11
if col_type != object:
12
c_min = df[col].min()
13
c_max = df[col].max()
14
if str(col_type)[:3] == 'int':
15
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
16
df[col] = df[col].astype(np.int8)
17
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
18
df[col] = df[col].astype(np.int16)
19
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
20
df[col] = df[col].astype(np.int32)
21
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
22
df[col] = df[col].astype(np.int64)
23
else:
24
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
25
df[col] = df[col].astype(np.float16)
26
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
27
df[col] = df[col].astype(np.float32)
28
else:
29
df[col] = df[col].astype(np.float64)
30
else:
31
df[col] = df[col].astype('category')
32

33
end_mem = df.memory_usage().sum()
34
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
35
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
36
return df

1
sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))
Memory usage of dataframe is 60507328.00 MB

Memory usage after optimization is: 15724107.00 MB

Decreased by 74.0%

1
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]
4.4.2 线性回归 & 五折交叉验证 & 模拟真实业务情况

1
sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
2
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
3
train = sample_feature[continuous_feature_names + ['price']]
4

5
train_X = train[continuous_feature_names]
6
train_y = train['price']
4.4.2 - 1 简单建模

1
from sklearn.linear_model import LinearRegression

1
model = LinearRegression(normalize=True)

1
model = model.fit(train_X, train_y)
查看训练的线性回归模型的截距（intercept）与权重(coef)

1
'intercept:'+ str(model.intercept_)
2

3
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
[('v_6', 3342612.384537345),
, ('v_8', 684205.534533214),
, ('v_9', 178967.94192530424),
, ('v_7', 35223.07319016895),
, ('v_5', 21917.550249749802),
, ('v_3', 12782.03250792227),
, ('v_12', 11654.925634146672),
, ('v_13', 9884.194615297649),
, ('v_11', 5519.182176035517),
, ('v_10', 3765.6101415594258),
, ('gearbox', 900.3205339198406),
, ('fuelType', 353.5206495542567),
, ('bodyType', 186.51797317460046),
, ('city', 45.17354204168846),
, ('power', 31.163045441455335),
, ('brand_price_median', 0.535967111869784),
, ('brand_price_std', 0.4346788365040235),
, ('brand_amount', 0.15308295553300566),
, ('brand_price_max', 0.003891831020467389),
, ('seller', -1.2684613466262817e-06),
, ('offerType', -4.759058356285095e-06),
, ('brand_price_sum', -2.2430642281682917e-05),
, ('name', -0.00042591632723759166),
, ('used_time', -0.012574429533889028),
, ('brand_price_average', -0.414105722833381),
, ('brand_price_min', -2.3163823428971835),
, ('train', -5.392535065078232),
, ('power_bin', -59.24591853031839),
, ('v_14', -233.1604256172217),
, ('kilometer', -372.96600915402496),
, ('notRepairedDamage', -449.29703564695365),
, ('v_0', -1490.6790578168238),
, ('v_4', -14219.648899108111),
, ('v_2', -16528.55239086934),
, ('v_1', -42869.43976200439)]

1
from matplotlib import pyplot as plt

1
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
绘制特征v_9的值与标签的散点图，图片发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，且部分预测值出现了小于0的情况，说明我们的模型存在一些问题

1
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
2
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
3
plt.xlabel('v_9')
4
plt.ylabel('price')
5
plt.legend(['True Price','Predicted Price'],loc='upper right')
6
print('The predicted price is obvious different from true price')
7
plt.show()
The predicted price is obvious different from true price

通过作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设。参考博客：https://blog.csdn.net/Noob_daniel/article/details/76087829

1
import seaborn as sns
2
print('It is clear to see the price shows a typical exponential distribution')
3
plt.figure(figsize=(15,5))
4
plt.subplot(1,2,1)
5
sns.distplot(train_y)
6
plt.subplot(1,2,2)
7
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])
It is clear to see the price shows a typical exponential distribution
<matplotlib.axes._subplots.AxesSubplot at 0x1b33efb2f98>

在这里我们对标签进行了
l
o
g
(
x
+
1
)
变换，使标签贴近于正态分布

1
train_y_ln = np.log(train_y + 1)

1
import seaborn as sns
2
print('The transformed price seems like normal distribution')
3
plt.figure(figsize=(15,5))
4
plt.subplot(1,2,1)
5
sns.distplot(train_y_ln)
6
plt.subplot(1,2,2)
7
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])
The transformed price seems like normal distribution
<matplotlib.axes._subplots.AxesSubplot at 0x1b33f077160>

1
model = model.fit(train_X, train_y_ln)
2

3
print('intercept:'+ str(model.intercept_))
4
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
intercept:23.515920686637713
[('v_9', 6.043993029165403),
, ('v_12', 2.0357439855551394),
, ('v_11', 1.3607608712255672),
, ('v_1', 1.3079816298861897),
, ('v_13', 1.0788833838535354),
, ('v_3', 0.9895814429387444),
, ('gearbox', 0.009170812023421397),
, ('fuelType', 0.006447089787635784),
, ('bodyType', 0.004815242907679581),
, ('power_bin', 0.003151801949447194),
, ('power', 0.0012550361843629999),
, ('train', 0.0001429273782925814),
, ('brand_price_min', 2.0721302299502698e-05),
, ('brand_price_average', 5.308179717783439e-06),
, ('brand_amount', 2.8308531339942507e-06),
, ('brand_price_max', 6.764442596115763e-07),
, ('offerType', 1.6765966392995324e-10),
, ('seller', 9.308109838457312e-12),
, ('brand_price_sum', -1.3473184925468486e-10),
, ('name', -7.11403461065247e-08),
, ('brand_price_median', -1.7608143661053008e-06),
, ('brand_price_std', -2.7899058266986454e-06),
, ('used_time', -5.6142735899344175e-06),
, ('city', -0.0024992974087053223),
, ('v_14', -0.012754139659375262),
, ('kilometer', -0.013999175312751872),
, ('v_0', -0.04553774829634237),
, ('notRepairedDamage', -0.273686961116076),
, ('v_7', -0.7455902679730504),
, ('v_4', -0.9281349233755761),
, ('v_2', -1.2781892166433606),
, ('v_5', -1.5458846136756323),
, ('v_10', -1.8059217242413748),
, ('v_8', -42.611729973490604),
, ('v_6', -241.30992120503035)]
再次进行可视化，发现预测结果与真实值较为接近，且未出现异常状况
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()
The predicted price seems normal after np.log transforming

5.4.2 - 2 五折交叉验证
在使用训练集对参数进行训练的时候，经常会发现人们通常会将一整个训练集分为三个部分（比如mnist手写训练集）。一般分为：训练集（train_set），评估集（valid_set），测试集（test_set）这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解，其实就是完全不参与训练的数据，仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。

因为在实际的训练中，训练的结果对于训练集的拟合程度通常还是挺好的（初始条件敏感），但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练，而是分出一部分来（这一部分不参加训练）对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证（Cross Validation）

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer
def log_transfer(func):
def wrapper(y, yhat):
result = func(np.log(y), np.nan_to_num(np.log(yhat)))
return result
return wrapper
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 1.1s finished
使用线性回归模型，对未处理标签的特征数据进行五折交叉验证（Error 1.36）

print('AVG:', np.mean(scores))
AVG: 1.3641908155886227
使用线性回归模型，对处理过标签的特征数据进行五折交叉验证（Error 0.19）

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 1.1s finished
print('AVG:', np.mean(scores))
AVG: 0.19382863663604424
scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

task4 建模与调参 打卡

task4 建模与调参打卡