scikit-learn机器学习(四)使用决策树做分类

我们使用决策树来创建一个能屏蔽网页横幅广告的软件。

已知图片的数据判断它属于广告还是文章内容。

数据来自 http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements

其中包含3279张图片的数据，该数据集中的类的比例是不均衡的，459张图片是广告，零位2820张图片是文章内容。

首先导入数据，数据预处理

# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('ad-dataset/ad.data',header=None)

variable_col = set(df.columns.values) #共有几列
variable_col.remove(len(df.columns.values)-1) #最后一列是标签
label_col= df[len(df.columns.values)-1] #把标签列取出来

y = [1 if e=='ad.' else 0 for e in label_col] #把标签转为数值
X = df[list(variable_col)].copy() #把前面的所有列作为X
X.replace(to_replace=' *?',value=-1,regex=True,inplace=True) #数据中的缺失值是 *?,我们用-1替换缺失值
X_train,X_test,y_train,y_test = train_test_split(X,y)

建立决策树，网格搜索微调模型

# In[1] 网格搜索微调模型
pipeline = Pipeline([
        ('clf',DecisionTreeClassifier(criterion='entropy'))
        ])
parameters={
        'clf__max_depth':(150,155,160),
        'clf__min_samples_split':(2,3),
        'clf__min_samples_leaf':(1,2,3)
        }
#GridSearchCV 用于系统地遍历多种参数组合，通过交叉验证确定最佳效果参数。
grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,verbose=-1,scoring='f1')
grid_search.fit(X_train,y_train)

# 获取搜索到的最优参数
best_parameters = grid_search.best_estimator_.get_params()
print("最好的F1值为：",grid_search.best_score_)
print('最好的参数为：')
for param_name in sorted(parameters.keys()):
    print('t%s: %r' % (param_name,best_parameters[param_name]))

最好的F1值为： 0.8753026365252053
最好的参数为：
tclf__max_depth: 160
tclf__min_samples_leaf: 1
tclf__min_samples_split: 3

评价模型

# In[2] 输出预测结果并评价
predictions = grid_search.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.98      0.99      0.98       695
           1       0.93      0.89      0.91       125

   micro avg       0.97      0.97      0.97       820
   macro avg       0.95      0.94      0.94       820
weighted avg       0.97      0.97      0.97       820