机器学习from(zhouxun-old leader)

Main:

  • Template.py
  • Template.py 为主要流程部分,依次实现:
  1. Train Test Split
  2. Missing Imputation
  3. Feature Selection
  4. Cap and Floor
  5. Data Scaling
  6. Model Selection
  7. Feature Reduction
  8. AUC & KS graphing, Model Ranking and PSI
  9. 模型预测的概率对应逾期率的排序情况
  10. 模型变量递减后效果验证
  11. 最终模型的Feature Importance图像(如有)
  12. 以PKL或PMML形式保存模型

Function Files:

  • Data_Processing.py
  • Data_Processing.py 为数据预处理部分,包含函数:
  1. Train Test Split
  2. Feature Selection
  3. Cap and Floor
  • Multiple_Model_Selection.py
  • Multiple_Model_Selection.py 为模型选择部分,实现:
  1. grids search with cross validation(CV) 和 randomized search with CV 两种方式的模型选择,并可输出训练多类模型后的最优模型。
  2. 得到最优模型后,减少模型变量(Feature Reduction),再验证其效果,希望达到与之前得到的最优模型的效果不发生较大下降的前提下,尽可能的减少入参变量,从而增强模型的Robustness。
  • Model_Evaluation.py
  • Model_Evaluation.py 为模型效果验证部分,实现:
  1. AUC和KS画图部分
  2. PSI计算
  3. Model Ranking
  4. 模型预测的概率对应逾期率的排序情况
  5. 最终模型的feature importance图像(如有)

代码整体是用functional programming 的思路,且是框架性的代码,所以在变量处理时没有过于细化,需要根据实际数据结构的情况进行添加

一、先说Data_Processing部分:

先导入模块

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit  #主要用于label分布不均匀的样本中
from sklearn.feature_selection import VarianceThreshold, SelectFromModel  #第一个是特征选择中的方差阈值法(设定一个阈值,小于这个阈值就丢弃),第二个是嵌入式特征选择的一种
#from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import ExtraTreesClassifier  #极端随机树,是随机深林的一种
from matplotlib import style, pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # Display Chinese Characters
plt.rcParams['axes.unicode_minus'] = False # Display Minus Sign
style.use('ggplot')

1.数据划分,根据StratifiedShuffleSplit(主要是数据分布不均匀时使用),返回的xy合并的训练集和测试集

train, test = trainTestSplitV2(dataset, 'flag')

def trainTestSplitV2(data, response, testsize = 0.3, trainsize = 0.7, rdm_state = None):
    '''
    Train test split with tolerance of the mean difference between
    dataset and test set.
    
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response.

    response: string, name of response column in data.  相当是y值label
    
    testsize: numeric, between 0.0 and 1.0, the size of the testing set.
    
    trainsize: numeric, between 0.0 and 1.0, the size of the training set.
    
    rdm_state: None or int, random state of feature selection model.
    -----------
    '''
    
    X = np.array(data.drop(response, axis = 1))  #为什么需要array数组化,那是由于很多模型不能直接输入df 
    y = np.array(data[response])
    
    sssplit = StratifiedShuffleSplit(n_splits = 1, test_size = testsize, 
                                     train_size = trainsize, random_state = rdm_state)  #37分,且只分一折,这就不需要for循环了
    
    # Generate indices to split data into training and test set.
    split_index = sssplit.split(X, y)
    train_index, test_index = next(split_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Get all the columns name back
    train_data = pd.DataFrame(data = np.c_[X_train, y_train],   #np.c_是左右连接两个矩阵,保持行数不变
                              columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response])))  #.tolist()是将数组或者矩阵转换成list
    test_data = pd.DataFrame(data = np.c_[X_test, y_test], 
                             columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response])))
 
    print('Mean of y_all is: {:.4f} 
Mean of y_train is: {:.4f} 
Mean of y_test is: {:.4f}'.format(data[response].mean(), 
                                                                                                     train_data[response].mean(),
                                                                                                     test_data[response].mean()))       

    return(train_data, test_data)

2.计算每个特征的var,根据VarianceThreshold挑选方差不为0的特征,返回的是方差为0的特征(后面需要将这些特征踢除)

elmi_col = varThreshold(dataset, train)

#-------------------------Variance Threshold Function-------------------------
def varThreshold(data, trainset, thd = 0):
    '''
    Feature selector that removes all low-variance features.
    
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response.

    trainset: nparray, training features set.
    
    thd: numeric, threshold of the variance.
    -----------
    '''
    
    sel_def = VarianceThreshold(threshold = thd)
    new_train = sel_def.fit_transform(trainset)
    print('The Number of Features Selected When Removes All Zero-variance Features:', new_train.shape[1])  ##其中shape[0]是行数,shape[1]是列数 
    
    # Get names of low variance featuers
    bool_arr = (sel_def.variances_ == thd).tolist() #方差等于阈值 ,放回的是一组布尔值的list,True是等于阈值,其他的是不等于
    seq = [i for i, value in enumerate(bool_arr) if value]  #将true的index输出来
    eliminate_feature = [data.columns.tolist()[ele] for ele in seq]
    
    return(eliminate_feature)
['身份证_姓名命中法院结案模糊名单',
 '身份证命中信贷逾期名单',
 '手机号命中信贷逾期名单',
 '第一联系人手机号命中信贷逾期名单',
 '身份证命中法院失信名单',
 '身份证_姓名命中法院失信模糊名单',
 '身份证命中公司欠税名单',
 'X1month_第三方服务商',
 'X1month_理财机构',
 'X1month_银行小微贷款',
 'X1month_汽车租赁',
 'X1month_房地产金融',
 'X1month_融资租赁',
 'X3month_银行小微贷款',
 'X3month_房地产金融',
 'X7days_互联网金融门户',
 'X7days_第三方服务商',
 'X7days_理财机构',
 'X7days_财产保险',
 'X7days_银行小微贷款',
 'X7days_汽车租赁',
 'X7days_房地产金融',
 'X7days_融资租赁']

3.异常值的处理(将大于或者小于某个阈值的数据用阈值代替)和数据缩放(使用最大最小值缩放),返回的是每个特征的阈值分位数和缩放后的数据

注意:是先运行4的然后再运行的3,也就是说先选出了50个特征,对这些特征的-1值再换成空值,在进行下面这个操作

quantile_limit, test = replaceOutlierNScale(train, test, 'flag', low = 0.02, high = 0.98, scale = True)
quantile_limit, train = replaceOutlierNScale(train, train, 'flag', low = 0.02, high = 0.98, scale = True)

返回的是处理阈值后并且缩放后的特征

#-------------------------Apply Quantile(Low and High) of Data to Replace Outliers-------------------------
def replaceOutlierNScale(data_a, data_b, response, low = 0.02, high = 0.98, 
                         interpolation = 'midpoint', scale = True):
    '''
    Replace low and high quantile outliers of data_b with specified quantile value of data_a.
    Only deal with numeric columns, ignore the Nan value and string type columns.
    
    Parameter:
    -----------
    data_a & data_b: pandas DataFrame, have same columns and only have numeric columns.
    
    response: string, name of response column in data.
        
    low: low limit.
    
    high: high limit.
    
    interpolation: This parameter specifies the interpolation method to use.  
    -----------
    '''
    
    # Replace Outliers: Cap and floor 
    new_df = data_b.drop(response, axis = 1)
    
    quantile_limit = data_a.drop(response, axis = 1).quantile([low, high], interpolation = interpolation) #得到以low, high为index的二维表
    '''
    如:

       0    1    2    3
       0.02    0.0    1.0    0.5    3.0
       0.98    0.0    1.5    2.5    3.0
    '''
    
    outliers_low = (new_df < quantile_limit.loc[low, :])  #返回的布尔型的df 
    '''    0    1    2    3
    0    False    False    True    False
    1    False    False    False    False
    2    False    False    False    False
    '''

    new_df.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1)  #将小于阈值的用阈值代替,mask是条件为True时才替换
    
    outliers_high = (new_df > quantile_limit.loc[high, :])
    new_df.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1)
    
    new_df[response] = data_b[response]
    
    # Min-max scale
    if scale:
#        # Replace Outliers: Cap and floor 
#        X_a = data_a.drop(response, axis = 1)
#        outliers_low = (X_a < quantile_limit.loc[low, :])
#        X_a.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1)  
#        outliers_high = (X_a > quantile_limit.loc[high, :])
#        X_a.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1)
        
#        X_a = np.array(X_a)
#        X_b = np.array(new_df.drop(response, axis = 1))
        
#        # Fit on X_a and transform to X_b 
#        min_max_scaler = MinMaxScaler().fit(X_a)
#        X_b = min_max_scaler.transform(X_b)
        
#        # Get all the columns name back
#        new_df = pd.DataFrame(data = np.c_[X_b, np.array(data_b[response])], 
#                              columns = np.array((data_b.drop(response, axis = 1).columns.tolist() + [response])))
    
        new_df.drop(response, axis = 1, inplace = True)
        new_df = (new_df - quantile_limit.min()) / (quantile_limit.max() - quantile_limit.min())  #用最大最小值去缩放
        
        new_df[response] = data_b[response] 
    
    return(quantile_limit, new_df)   
    

4.特征选择(featureSelectFromModel),返回feature_importances_

imp_col =featureSelectFromModel(train, 'flag', figname = 'Top 10 Important Features', 
                                n_tree = 500, n_core = -1, rdm_state = None, thd = 'median', 
                                word = 1, show = 10, figsize = (12, 10))

 是先运行4的然后再运行的3

#-------------------------Feature Selected From Model-------------------------
def featureSelectFromModel(trainset, response, figname, 
                           n_tree = 500, n_core = -1, 
                           rdm_state = None, word = 0, thd = 'median', 
                           show = 10, figsize = (10, 8)):
    '''
    Feature selector that removes all unimportant features.
    
    Parameters:
    -----------
    trainset : pandas DataFrame with all possible predictors and response in train set.
    
    response: string, name of response column in data.
    
    figname: string, name of feature importace graph.
    
    rdm_state: None or int, random state of feature selection model.
    
    thd: string, seted as 'median', 'mean' or '0.25*median', '0.25*mean', 
    means features whose importance is greater or equal to thd are kept 
    while the others are discarded.
        
    show: int, the number of features to be shown in the graph.
    
    figsize: tuple, the size of figure.
    -----------
    '''
    
    column_names = trainset.drop(response, axis = 1).columns.tolist()
    X = np.array(trainset.drop(response, axis = 1))
    y = np.array(trainset[response])
    # Create the SelectFromModel object and retrieve the optimal number of features which
    # the threshold value is set as thd of the feature importances.
    clf = ExtraTreesClassifier(n_estimators = n_tree, n_jobs = n_core, 
                               random_state = rdm_state, verbose = word)
    clf = clf.fit(X, y)  
    model = SelectFromModel(clf, threshold = thd, prefit = True)
    X_new = model.transform(X)
    print('The Number of Features Selected:', X_new.shape[1]) 
    
    # Get the feature importances
    importances = clf.feature_importances_  #返回的是一个数组,个数和特征个数一致
    
    #***************Obtain names of importannt features***************
    def ImptFeature(thd = thd, impt = importances):
        method = thd.split('*')
        if 'median' in method:
            name_val = pd.Series(impt, index = column_names)  #注意这里index即是列名,所以后面返回的其实是列名
            if len(method) == 1:  #本来这个长度就等于1
                imp_val = name_val[name_val >= np.median(impt)]  #选择特征重要度大于均值的特征
                return(imp_val.index.tolist())
                
            elif len(method) == 2:
                coef = np.float(method[0])
                imp_val = name_val[name_val >= (coef * np.median(impt))]
                return(imp_val.index.tolist())
                
            else:
                print('"thd" maybe not follow the pattern, check it !!!')
                
        elif 'mean' in method:
            name_val = pd.Series(impt, index = column_names)
            if len(method) == 1:
                imp_val = name_val[name_val >= np.mean(impt)]
                return(imp_val.index.tolist())
                
            elif len(method) == 2:
                coef = np.float(method[0])
                imp_val = name_val[name_val >= (coef * np.mean(impt))]
                return(imp_val.index.tolist())
                
            else:
                print('"thd" maybe not follow the pattern, check it !!!')
                
        else:
            print('"thd" maybe not follow the pattern, check it !!!')
    #*********************************************************

    # Standard deviation of feature importances
    std = np.std([tree.feature_importances_ for tree in clf.estimators_],
                 axis=0)
    
    # Return the indices of the top several important features 
    indices = np.argsort(importances)[::-1][:show]  #设定了show=10,即是说前10名的特征
    
    # Get the top important features name
    features = [column_names[ele] for ele in indices]    
    
    #***************Graph of feature importance***************
    # Show importance of each feature   
    fig = plt.figure(figsize = figsize)
    axes = plt.subplot2grid((1,1), (0,0))
    axes.bar(range(show), importances[indices],
           color = '#4682B4', yerr = std[indices], align = 'center')
    plt.xticks(range(show), features)
    plt.xlim((-1, show))
    
    # Rotate the angle of the labels
    for label in axes.xaxis.get_ticklabels():
        label.set_rotation(90)
        
    plt.title(('Top ' + str(show) + ' Important Features'))
    plt.xlabel('Name of Variable')
    plt.ylabel('Importance')   
    plt.savefig((figname + '.jpg'))
    #*********************************************************
    
    # Output the top important features
    print('Top ' + str(show) + ' Features Ranking:')
    for f in range(show): 
        print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1,
              indices[f], column_names[indices[f]], importances[indices[f]]))
        
    return(ImptFeature())

 返回的是50个入选的特征

['loan_amount',
 'loan_term',
 'final_score',
 'X3个月内申请人在多个平台申请借款',
 'X1个月内申请人在多个平台申请借款',
 'X1month_P2P网贷',
 'X1month_财产保险',
 'X3month_P2P网贷',
 'X3month大型消费金融公司',
 'X3month_互联网金融门户',
 'X3month_信用卡中心',
 'label',
 'cell_number_xiangguan',
 'risk_count',
 'annual_income_500000',
 'X_y',
 '本品牌合作时间',
 '经营年限_注册时间_',
 '上级经销商法人手机',
 '总和_经营面积',
 '总和_经营年限',
 '总和_年销售额',
 '总和_年销售总量',
 'order_sum',
 'first_dd_sum',
 'first_dd_avg',
 'first_dd_sd',
 'max_dd_sum',
 'max_dd_avg',
 'max_dd_sd',
 'pass_term_avg',
 'pass_term_sd',
 'fpd_sum',
 'loan_amt_sum',
 'loan_amt_avg',
 'loan_amt_sd',
 'fpd_times',
 'times',
 'loan_times',
 'xs_amount',
 'xy_amount',
 'history',
 'layer',
 'type',
 'size',
 'wang2_count_x',
 'wang2_def30_count',
 'wang2_c_order_sum',
 'wang2_count_y',
 'amount_sum']
View Code

Data_Processing的全部代码(已折叠)

# -*- coding: utf-8 -*-
"""
Created on Tue Nov 21 21:36:04 2017

@author: Hin
"""


import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit  #主要用于label分布不均匀的样本中
from sklearn.feature_selection import VarianceThreshold, SelectFromModel  #第一个是特征选择中的方差阈值法(设定一个阈值,小于这个阈值就丢弃),第二个是嵌入式特征选择的一种
#from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import ExtraTreesClassifier  #极端随机树,是随机深林的一种
from matplotlib import style, pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # Display Chinese Characters
plt.rcParams['axes.unicode_minus'] = False # Display Minus Sign
style.use('ggplot')


def trainTestSplitV2(data, response, testsize = 0.3, trainsize = 0.7, rdm_state = None):
    '''
    Train test split with tolerance of the mean difference between
    dataset and test set.
    
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response.

    response: string, name of response column in data.  相当是y值label
    
    testsize: numeric, between 0.0 and 1.0, the size of the testing set.
    
    trainsize: numeric, between 0.0 and 1.0, the size of the training set.
    
    rdm_state: None or int, random state of feature selection model.
    -----------
    '''
    
    X = np.array(data.drop(response, axis = 1))  #为什么需要array数组化,那是由于很多模型不能直接输入df 
    y = np.array(data[response])
    
    sssplit = StratifiedShuffleSplit(n_splits = 1, test_size = testsize, 
                                     train_size = trainsize, random_state = rdm_state)  #37分,且只分一折,这就不需要for循环了
    
    # Generate indices to split data into training and test set.
    split_index = sssplit.split(X, y)
    train_index, test_index = next(split_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Get all the columns name back
    train_data = pd.DataFrame(data = np.c_[X_train, y_train],   #np.c_是左右连接两个矩阵,保持行数不变
                              columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response])))  #.tolist()是将数组或者矩阵转换成list
    test_data = pd.DataFrame(data = np.c_[X_test, y_test], 
                             columns = np.array((data.drop(response, axis = 1).columns.tolist() + [response])))
 
    print('Mean of y_all is: {:.4f} 
Mean of y_train is: {:.4f} 
Mean of y_test is: {:.4f}'.format(data[response].mean(), 
                                                                                                     train_data[response].mean(),
                                                                                                     test_data[response].mean()))       

    return(train_data, test_data)
    

#-------------------------Variance Threshold Function-------------------------
def varThreshold(data, trainset, thd = 0):
    '''
    Feature selector that removes all low-variance features.
    
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response.

    trainset: nparray, training features set.
    
    thd: numeric, threshold of the variance.
    -----------
    '''
    
    sel_def = VarianceThreshold(threshold = thd)
    new_train = sel_def.fit_transform(trainset)
    print('The Number of Features Selected When Removes All Zero-variance Features:', new_train.shape[1])  ##其中shape[0]是行数,shape[1]是列数 
    
    # Get names of low variance featuers
    bool_arr = (sel_def.variances_ == thd).tolist() #方差等于阈值 ,放回的是一组布尔值的list,True是等于阈值,其他的是不等于
    seq = [i for i, value in enumerate(bool_arr) if value]  #将true的index输出来
    eliminate_feature = [data.columns.tolist()[ele] for ele in seq]
    
    return(eliminate_feature)
    
    
#-------------------------Apply Quantile(Low and High) of Data to Replace Outliers-------------------------
def replaceOutlierNScale(data_a, data_b, response, low = 0.02, high = 0.98, 
                         interpolation = 'midpoint', scale = True):
    '''
    Replace low and high quantile outliers of data_b with specified quantile value of data_a.
    Only deal with numeric columns, ignore the Nan value and string type columns.
    
    Parameter:
    -----------
    data_a & data_b: pandas DataFrame, have same columns and only have numeric columns.
    
    response: string, name of response column in data.
        
    low: low limit.
    
    high: high limit.
    
    interpolation: This parameter specifies the interpolation method to use.  
    -----------
    '''
    
    # Replace Outliers: Cap and floor 
    new_df = data_b.drop(response, axis = 1)
    
    quantile_limit = data_a.drop(response, axis = 1).quantile([low, high], interpolation = interpolation) #得到以low, high为index的二维表
    '''
    如:

       0    1    2    3
       0.02    0.0    1.0    0.5    3.0
       0.98    0.0    1.5    2.5    3.0
    '''
    
    outliers_low = (new_df < quantile_limit.loc[low, :])  #返回的布尔型的df 
    '''    0    1    2    3
    0    False    False    True    False
    1    False    False    False    False
    2    False    False    False    False
    '''

    new_df.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1)  #将小于阈值的用阈值代替,mask是条件为True时才替换
    
    outliers_high = (new_df > quantile_limit.loc[high, :])
    new_df.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1)
    
    new_df[response] = data_b[response]
    
    # Min-max scale
    if scale:
#        # Replace Outliers: Cap and floor 
#        X_a = data_a.drop(response, axis = 1)
#        outliers_low = (X_a < quantile_limit.loc[low, :])
#        X_a.mask(outliers_low, quantile_limit.loc[low, :], inplace = True, axis = 1)  
#        outliers_high = (X_a > quantile_limit.loc[high, :])
#        X_a.mask(outliers_high, quantile_limit.loc[high, :], inplace = True, axis = 1)
        
#        X_a = np.array(X_a)
#        X_b = np.array(new_df.drop(response, axis = 1))
        
#        # Fit on X_a and transform to X_b 
#        min_max_scaler = MinMaxScaler().fit(X_a)
#        X_b = min_max_scaler.transform(X_b)
        
#        # Get all the columns name back
#        new_df = pd.DataFrame(data = np.c_[X_b, np.array(data_b[response])], 
#                              columns = np.array((data_b.drop(response, axis = 1).columns.tolist() + [response])))
    
        new_df.drop(response, axis = 1, inplace = True)
        new_df = (new_df - quantile_limit.min()) / (quantile_limit.max() - quantile_limit.min())  #用最大最小值去缩放
        
        new_df[response] = data_b[response] 
    
    return(quantile_limit, new_df)   
    
    
#-------------------------Feature Selected From Model-------------------------
def featureSelectFromModel(trainset, response, figname, 
                           n_tree = 500, n_core = -1, 
                           rdm_state = None, word = 0, thd = 'median', 
                           show = 10, figsize = (10, 8)):
    '''
    Feature selector that removes all unimportant features.
    
    Parameters:
    -----------
    trainset : pandas DataFrame with all possible predictors and response in train set.
    
    response: string, name of response column in data.
    
    figname: string, name of feature importace graph.
    
    rdm_state: None or int, random state of feature selection model.
    
    thd: string, seted as 'median', 'mean' or '0.25*median', '0.25*mean', 
    means features whose importance is greater or equal to thd are kept 
    while the others are discarded.
        
    show: int, the number of features to be shown in the graph.
    
    figsize: tuple, the size of figure.
    -----------
    '''
    
    column_names = trainset.drop(response, axis = 1).columns.tolist()
    X = np.array(trainset.drop(response, axis = 1))
    y = np.array(trainset[response])
    # Create the SelectFromModel object and retrieve the optimal number of features which
    # the threshold value is set as thd of the feature importances.
    clf = ExtraTreesClassifier(n_estimators = n_tree, n_jobs = n_core, 
                               random_state = rdm_state, verbose = word)
    clf = clf.fit(X, y)  
    model = SelectFromModel(clf, threshold = thd, prefit = True)
    X_new = model.transform(X)
    print('The Number of Features Selected:', X_new.shape[1]) 
    
    # Get the feature importances
    importances = clf.feature_importances_  #返回的是一个数组,个数和特征个数一致
    
    #***************Obtain names of importannt features***************
    def ImptFeature(thd = thd, impt = importances):
        method = thd.split('*')
        if 'median' in method:
            name_val = pd.Series(impt, index = column_names)
            if len(method) == 1:  #本来这个长度就等于1
                imp_val = name_val[name_val >= np.median(impt)]  #选择特征重要度大于均值的特征
                return(imp_val.index.tolist())
                
            elif len(method) == 2:
                coef = np.float(method[0])
                imp_val = name_val[name_val >= (coef * np.median(impt))]
                return(imp_val.index.tolist())
                
            else:
                print('"thd" maybe not follow the pattern, check it !!!')
                
        elif 'mean' in method:
            name_val = pd.Series(impt, index = column_names)
            if len(method) == 1:
                imp_val = name_val[name_val >= np.mean(impt)]
                return(imp_val.index.tolist())
                
            elif len(method) == 2:
                coef = np.float(method[0])
                imp_val = name_val[name_val >= (coef * np.mean(impt))]
                return(imp_val.index.tolist())
                
            else:
                print('"thd" maybe not follow the pattern, check it !!!')
                
        else:
            print('"thd" maybe not follow the pattern, check it !!!')
    #*********************************************************

    # Standard deviation of feature importances
    std = np.std([tree.feature_importances_ for tree in clf.estimators_],
                 axis=0)
    
    # Return the indices of the top several important features 
    indices = np.argsort(importances)[::-1][:show]
    
    # Get the top important features name
    features = [column_names[ele] for ele in indices]    
    
    #***************Graph of feature importance***************
    # Show importance of each feature   
    fig = plt.figure(figsize = figsize)
    axes = plt.subplot2grid((1,1), (0,0))
    axes.bar(range(show), importances[indices],
           color = '#4682B4', yerr = std[indices], align = 'center')
    plt.xticks(range(show), features)
    plt.xlim((-1, show))
    
    # Rotate the angle of the labels
    for label in axes.xaxis.get_ticklabels():
        label.set_rotation(90)
        
    plt.title(('Top ' + str(show) + ' Important Features'))
    plt.xlabel('Name of Variable')
    plt.ylabel('Importance')   
    plt.savefig((figname + '.jpg'))
    #*********************************************************
    
    # Output the top important features
    print('Top ' + str(show) + ' Features Ranking:')
    for f in range(show): 
        print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1,
              indices[f], column_names[indices[f]], importances[indices[f]]))
        
    return(ImptFeature())
    
    
    


        
        
        
        
View Code

二、模型选择部分,两种方式的模型选择,并可输出训练多类模型后的最优模型

model_set = trainModelSequence(train, 'flag', classifiers, method = randomSearchCVTraining,
                               iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1)
Best score of RandomizedSearchCV is: 0.7967
Best parameters of RandomizedSearchCV is: 
 {'kernel': 'linear', 'gamma': 'auto', 'C': 100.0}
Best score of RandomizedSearchCV is: 0.8623
Best parameters of RandomizedSearchCV is: 
 {'max_depth': None, 'max_features': 0.3, 'min_samples_leaf': 6, 'min_samples_split': 8, 'n_estimators': 600}
Best score of RandomizedSearchCV is: 0.8514
Best parameters of RandomizedSearchCV is: 
 {'colsample_bylevel': 0.7, 'colsample_bytree': 0.7, 'gamma': 0.15000000000000002, 'learning_rate': 0.05, 'max_delta_step': 6, 'max_depth': 1, 'min_child_weight': 2, 'n_estimators': 300, 'random_state': 457, 'reg_alpha': 2, 'reg_lambda': 10, 'subsample': 0.7}
# Get best CV estimator
best_model = bestModel(model_set)
#输出的就是最高分数的模型,即是随机深林

=====到剪枝了,由前面可知有50个特征进入模型里面,因此要剪枝=======

但是发生了一件很疑惑的事情,我们上面跑的结果明明是随机深林的分数最高,但是后面剪枝的却使用了xgboost的,所以我就将随机深林的去掉,才继续跑了下面的剪枝

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 23 16:27:24 2017

@author: Hin
"""


import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


#-------------------------GridSearchCV-------------------------
def gridSearchCVTraining(data, response, params, mod = None, evalmetric = 'roc_auc',
                         n_core = -1, fold = 5, word = 1, dispatch = '2*n_jobs'):
    '''
    GridSearchCV: Exhaustive search over specified parameter values for an estimator.
    
    Parameters:
    -----------
    data: pandas DataFrame with all possible predictors and response.

    response: string, name of response column in data.
    
    other parameters see: scikit-learn documentation(http://scikit-learn.org/stable/modules/
    generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.fit)
    -----------
    '''
    
    X = np.array(data.drop(response, axis = 1))
    y = np.array(data[response])
    
    gs = GridSearchCV(estimator = mod, param_grid = params, scoring = evalmetric, 
                      n_jobs= n_core, cv = fold, refit = True, verbose = word, 
                      pre_dispatch = dispatch, error_score = 'raise')
    
    gs.fit(X, y)

    print('
Best score of GridSearchCV is: {:.4f}'.format(round(gs.best_score_, 4)))
    print('
Best parameters of GridSearchCV is: 
 {}'.format(gs.best_params_))

    return(gs)

#-------------------------RandomizedSearchCV-------------------------
def randomSearchCVTraining(data, response, params, mod = None, evalmetric = 'roc_auc', 
                           n_core = -1, iter_num = 1000, fold = 5, word = 1, 
                           dispatch = '2*n_jobs'):
    '''
    RandomizedSearchCV: Randomized search on hyper parameters.
    
    Parameters:
    -----------
    data: pandas DataFrame with all possible predictors and response.

    response: string, name of response column in data.
    
    other parameters see: scikit-learn documentation(http://scikit-learn.org/stable/
    modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)
    -----------
    '''
    
    X = np.array(data.drop(response, axis = 1))
    y = np.array(data[response])
    
    rs = RandomizedSearchCV(estimator = mod, param_distributions  = params, n_iter = iter_num, 
                            scoring = evalmetric, n_jobs= n_core, cv = fold, refit = True, 
                            verbose = word, pre_dispatch = dispatch, error_score = 'raise')
    
    rs.fit(X, y)

    print('
Best score of RandomizedSearchCV is: {:.4f}'.format(round(rs.best_score_, 4)))
    print('
Best parameters of RandomizedSearchCV is: 
 {}'.format(rs.best_params_))

    return(rs)


#-------------------------Training Models in Sequence-------------------------
def trainModelSequence(data, response, classifiers, method = randomSearchCVTraining,
                       iternum = None, evalmetric = 'roc_auc', n_core = -1, fold = 5, 
                       word = 1, dispatch = '2*n_jobs'):
    '''
这里使用了函数作为参数,注意参数里面的函数
Training several models in sequence. Parameters: ----------- data: pandas DataFrame with all possible predictors and response. response: string, name of response column in data. classifiers: a set of algorithms. method: the method of parameter tuning, randomSearchCVTraining or gridSearchCVTraining. other parameters see: scikit-learn documentation(RandomizedSearchCV / GridSearchCV) ----------- ''' model_set = [] # RandomizedSearchCV Interface if(iternum): for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) #输出每一个模型的名字,如SVM # Train model model = method(data, response, params = classifiers[clf][2], #也就是说每个模型的参数 mod = classifiers[clf][1], iter_num = iternum, #mod是每个基模型,如:SVM() evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) #返回的是fit(x,y),并输出best_score_和best_parms_ bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) # GridSearchCV Interface: set method = gridSearchCVTraining, iternum = None else: for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data, response, params = classifiers[clf][2], mod = classifiers[clf][1], evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) return(model_set) #-------------------------Select the Best Model in Model Sequence------------------------- def bestModel(model_set, eval_logic = 'Max'): ''' Select the best model from trainModelSequence. Parameters: ----------- model_set: result from trainModelSequence. eval_logic: either 'Max' or 'Min', represent the logic of the evalmetric method. ----------- ''' if (eval_logic == 'Max'): bst_model, cur_model = -np.inf, -np.inf for model in model_set: if (model['Best Score'] != 'Error'): if (np.float64(model['Best Score']) > cur_model): bst_model = model cur_model = np.float64(model['Best Score']) else: next else: next else: bst_model, cur_model = np.inf, np.inf for model in model_set: if (model['Best Score'].isnumeric()): if (np.float64(model['Best Score']) < cur_model): bst_model = model cur_model = np.float64(model['Best Score']) else: next else: next return(bst_model) #-------------------------Reduce Features from the Best Model------------------------- def featureReduce(data, response, classifiers, model, method = randomSearchCVTraining, interpolation = 'higher', iternum = None, evalmetric = 'roc_auc', n_core = -1, fold = 5, word = 1, dispatch = '2*n_jobs'): ''' Reduce features to increase robustness of model Parameters: ----------- data: pandas DataFrame with all possible predictors and response. response: string, name of response column in data. classifiers: a set of algorithms. model: result from RandomizedSearchCV / GridSearchCV. method: the method of parameter tuning, randomSearchCVTraining or gridSearchCVTraining. other parameters see: scikit-learn documentation(RandomizedSearchCV / GridSearchCV) ''' model_set = [] column_names = data.drop(response, axis = 1).columns.tolist() # Get feature importances importances = model.best_estimator_.feature_importances_ # Return the indices of the important features from max to min indices = np.argsort(importances)[::-1] # Get the top important features name features = [column_names[ele] for ele in indices] features = pd.DataFrame(data = features, index = range(len(features)), columns = ['feature']) features['seq'] = list(range(len(features))) quantile_limit = features.drop('feature', axis = 1).quantile(q = ([i/len(classifiers) for i in range(len(classifiers))] + [0.5]), interpolation = interpolation) quantile_limit.reset_index(drop = True, inplace = True) quantile_limit.sort_values('seq', inplace = True) quantile_limit = quantile_limit.iloc[1:, :] feature_sets = [(features['feature'][:int(i)].tolist() + [response]) for i in quantile_limit.seq.values] # RandomizedSearchCV Interface if(iternum): for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data[feature_sets[clf]], response, params = classifiers[clf][2], mod = classifiers[clf][1], iter_num = iternum, evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score, 'feature' : feature_sets[clf]}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) # GridSearchCV Interface: set method = gridSearchCVTraining, iternum = None else: for clf in range(len(classifiers)): try: print(' ***************** {} Begin *****************'.format(classifiers[clf][0])) # Train model model = method(data[feature_sets[clf]], response, params = classifiers[clf][2], mod = classifiers[clf][1], evalmetric = evalmetric, n_core = n_core, fold = fold, word = word, dispatch = dispatch) bst_score = '{:.4f}'.format(round(model.best_score_, 4)) model_set.append({classifiers[clf][0]: model, 'Best Score': bst_score, 'feature' : feature_sets[clf]}) except Exception as e: print(' ', '#' * 15, 'Error Message', '#' * 15, ' ') print('Error is:', e) print(' ', '#' * 15, 'End','#' * 15, ' ') model_set.append({classifiers[clf][0]: 'Error', 'Best Score': 'Error'}) next else: pass finally: print('***************** {} End ***************** '.format(classifiers[clf][0])) return(model_set)

三、模型效果验证

# -*- coding: utf-8 -*-
"""
Created on Sun Sep  3 12:06:42 2017

@author: Hin
"""


import math
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve, auc  #说明是分类模型
from matplotlib import style, pyplot as plt
style.use('ggplot')


#-------------------------Feature Importance Graph-------------------------
def featureImpGraph(data, response, model, figname = 'Feature Importance', 
                    show = 10, figsize = (10, 8)):
    '''
    Plot AUC and KS graph.
    
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response.

    response: string, name of response column in data.
    
    model: result from RandomizedSearchCV / GridSearchCV.
    
    figname: string, name of the graph.
    
    show: int, the number of features to be shown in the graph.
    
    figsize: tuple, the size of figure.
    -----------
    '''    
    
    column_names = data.drop(response, axis = 1).columns.tolist()
    
    # Get feature importances
    importances = model.best_estimator_.feature_importances_  #最好的基模型的特征重要性
    
    # Return the indices of the top several important features 
    indices = np.argsort(importances)[::-1][:show]  #[::-1]取从后向前(相当于和原来的顺序相反),argsort()是将X中的元素从小到大排序后,提取对应的索引index,然后输出到y
    
    # Get the top important features name
    features = [column_names[ele] for ele in indices]
    
    # Show importance of each feature   
    fig = plt.figure(figsize = figsize)
    axes = plt.subplot2grid((1,1), (0,0))
    axes.bar(range(show), importances[indices],
           color = '#4682B4', align = 'center')
    plt.xticks(range(show), features)
    plt.xlim((-1, show))
    
    # Rotate the angle of the labels
    for label in axes.xaxis.get_ticklabels():
        label.set_rotation(90)
        
    plt.title(('Top ' + str(show) + ' Important Features'))
    plt.xlabel('Name of Variable')
    plt.ylabel('Importance')   
    plt.savefig((figname + '.jpg'))
    
    # Output the top important features
    print('Top ' + str(show) + ' Features:')
    for f in range(show): 
        print('{}. Importance of feature {} named "{}" is: {:.4f}'.format(f + 1,
              indices[f], column_names[indices[f]], importances[indices[f]]))
  


#-------------------------AUC KS Graph-------------------------
def aucKSGraph(data, response, pred_value, pos_label, model_name = 'Model', 
               figname = 'AUC KS Graph', figsize = (10, 8)):
    '''
    Plot AUC and KS graph.
    
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response.

    response: string, name of response column in data.
    
    pred_value: pandas Series with the predict value of model.
    
    pos_label: int, for binary classification this represents positive label.
    
    figname: string, name of the graph.
    
    model_name: string, label of the graph.
    
    figsize: tuple, the size of figure.
    -----------
    '''
    
    fpr, tpr, thresholds = roc_curve(data[response], pred_value, pos_label = pos_label)
    
    ks_value = max(tpr - fpr)
    ks_value = round(ks_value, 4)
    roc_auc_value = auc(fpr, tpr)
    roc_auc_value = round(roc_auc_value, 4)
    
    fig = plt.figure(figsize = figsize)
    plt.plot([0, 1],[0, 1], linestyle = '--', color = 'b', 
             label = "random guessing")
   
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.title('{} {}'.format(model_name, figname))   
    plt.plot(fpr, tpr, label = '{} (auc = {:.4f}, ks = {:.4f})'.format(model_name, 
             roc_auc_value, ks_value), color = 'r')
    plt.legend(loc = 'lower right')
    
    plt.savefig((' '.join([model_name, figname]) + '.jpg')) 


#-------------------------Model Ranking Ability-------------------------
def modelRank(data, response, model, show = 10, pos_label = 1, neg_label = 0):
    '''
    Model Ranking Ability.
    
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response.

    response: string, name of response column in data.
    
    model: result from RandomizedSearchCV / GridSearchCV.
    
    show: int, the number of sections to be shown in the table.
    
    pos_label: int, for binary classification this represents positive label.
    
    neg_label: int, for binary classification this represents negative label.
    -----------
    '''       
    
    prob = data[[response]]
    prob['prob'] = model.predict_proba(data.drop(response, axis = 1).values)[:, 1]
    prob.sort_values('prob', ascending = False, inplace = True)
    prob.reset_index(drop = True, inplace = True)
    prob['seq'] = list(range(len(prob)))
    
    # Get quantile of data
    quantile_limit = prob.seq.quantile([i/show for i in range(1, (show + 1))], 
                                            interpolation = 'higher')
    quantile_limit.reset_index(drop = True, inplace = True)
    
    rank_df = pd.DataFrame()
    
    for i in quantile_limit.index.values:
        rank_df_tmp = pd.DataFrame()
        if i == 0:
            rank_df_tmp['Bad'] = [sum(prob.loc[: quantile_limit[i], response] == pos_label)]
            rank_df_tmp['Good'] = [sum(prob.loc[: quantile_limit[i], response] == neg_label)]
            rank_df_tmp['Total'] = [int(rank_df_tmp['Bad'] + rank_df_tmp['Good'])]
            rank_df_tmp['Bad_Rate'] = ['{:.4f}'.format(round(float(rank_df_tmp['Bad'] / rank_df_tmp['Total']), 4))]
            rank_df_tmp['Min_Prob'] = [prob.loc[: quantile_limit[i], 'prob'].min()]
            rank_df_tmp['Max_Prob'] = [prob.loc[: quantile_limit[i], 'prob'].max()]
            
            rank_df = rank_df.append(rank_df_tmp)
            
        else:
            rank_df_tmp['Bad'] = [sum(prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], response] == pos_label)]
            rank_df_tmp['Good'] = [sum(prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], response] == neg_label)]
            rank_df_tmp['Total'] = [int(rank_df_tmp['Bad'] + rank_df_tmp['Good'])]
            rank_df_tmp['Bad_Rate'] = ['{:.4f}'.format(round(float(rank_df_tmp['Bad'] / rank_df_tmp['Total']), 4))]
            rank_df_tmp['Min_Prob'] = [prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], 'prob'].min()]
            rank_df_tmp['Max_Prob'] = [prob.loc[(quantile_limit[i - 1] + 1) : quantile_limit[i], 'prob'].max()]
            
            rank_df = rank_df.append(rank_df_tmp)
            
    rank_df['Cum_Bad_Num'] = rank_df.Bad.cumsum()
    rank_df['Cum_Bad_Pct'] = rank_df['Cum_Bad_Num'] / rank_df['Bad'].sum()
    rank_df.drop('Cum_Bad_Num', axis = 1, inplace = True)
    rank_df['Cum_Bad_Pct'] = rank_df['Cum_Bad_Pct'].map(lambda x: '{:.4f}'.format(round(x, 4)))
     
    total = pd.DataFrame({'Bad' : [rank_df['Bad'].sum()],
                          'Good' : [rank_df['Good'].sum()],
                          'Total' : [rank_df['Total'].sum()],
                          'Bad_Rate' : ['{:.4f}'.format(round((rank_df['Bad'].sum() / rank_df['Total'].sum()), 4))],
                          'Min_Prob' : [rank_df['Min_Prob'].min()],
                          'Max_Prob' : [rank_df['Max_Prob'].max()],
                          'Cum_Bad_Pct' : ['1.0000']}) 
    
    rank_df = rank_df.append(total)
    rank_df.reset_index(drop = True, inplace = True)
    
    rank_df['Min_Prob'] = rank_df['Min_Prob'].map(lambda x: '{:.4f}'.format(round(x, 4)))
    rank_df['Max_Prob'] = rank_df['Max_Prob'].map(lambda x: '{:.4f}'.format(round(x, 4)))
    
    # Output ordered columns
    return(rank_df[['Min_Prob', 'Max_Prob', 'Bad', 'Good',
                    'Total', 'Bad_Rate', 'Cum_Bad_Pct']])
    
            
#-------------------------Population Stability Index(PSI)-------------------------
def PSI(data_a, data_b, response, model, show = 10):
    '''
    Calculate Population Stability Index(PSI).
    
    Parameters:
    -----------
    data_a & data_b: pandas DataFrame, have same columns and only have numeric columns.

    response: string, name of response column in data.
    
    model: result from RandomizedSearchCV / GridSearchCV.
    
    show: int, the number of sections to be split in the table.
    -----------
    '''
    
    # Get probability from base data and predict data
    prob_base = pd.Series(model.predict_proba(data_a.drop(response, axis = 1).values)[:, 1])
    quantile_limit = prob_base.quantile([i/show for i in range(1, (show + 1))], 
                                         interpolation = 'higher')
    quantile_limit.reset_index(drop = True, inplace = True)
    
    prob_pred = pd.Series(model.predict_proba(data_b.drop(response, axis = 1).values)[:, 1])
    
    # base and predict list
    base_list = []
    pred_list = []
    
    # Orignal  
    for i in quantile_limit.index.values:
        if i == 0:
            base_list.append((sum(prob_base <= quantile_limit[i]) / len(prob_base)))
            pred_list.append((sum(prob_pred <= quantile_limit[i]) / len(prob_pred)))
            
        else:
            base_list.append((sum((prob_base > quantile_limit[i - 1]) & (prob_base <= quantile_limit[i])) / len(prob_base)))
            pred_list.append((sum((prob_pred > quantile_limit[i - 1]) & (prob_pred <= quantile_limit[i])) / len(prob_pred)))
            
    # Deal with 0 in base_list & pred_list    
    psi = '{:.4f}%'.format((round(sum([np.inf if (y == 0 or t == 0) else ((t - y) * math.log(t / y)) for t, y in zip(pred_list, base_list)]), 
                            8) * 100))
    
    print('
{} sections PSI is: {}'.format(show, psi))
    
    if np.float(psi[:(len(psi) - 1)]) > 10:
        print('''
              ************************************
              Warning: Beware of the large PSI !!!
              ************************************
              ''')
        
    print('
')
        
#**************************************************    
#    # For cooperate Xu Min
#    for i in quantile_limit.index.values:
#        if i == 0:       
#            base_list.append((sum(prob_base < quantile_limit[i]) / len(prob_base)))
#            pred_list.append((sum(prob_pred < quantile_limit[i]) / len(prob_pred)))
#            
#        elif i == 9:       
#            base_list.append((sum((prob_base >= quantile_limit[i - 1]) & (prob_base <= quantile_limit[i])) / len(prob_base)))
#            pred_list.append((sum((prob_pred >= quantile_limit[i - 1]) & (prob_pred <= quantile_limit[i])) / len(prob_pred)))
#            
#        else:
#            base_list.append((sum((prob_base >= quantile_limit[i - 1]) & (prob_base < quantile_limit[i])) / len(prob_base)))
#            pred_list.append((sum((prob_pred >= quantile_limit[i - 1]) & (prob_pred < quantile_limit[i])) / len(prob_pred)))
#        
#    psi = '{:.4f}%'.format((round(sum([0 if y == 0 or t == 0 else ((t - y) * math.log(t / y)) for t, y in zip(pred_list, base_list)]), 
#                            8) * 100))
#**************************************************  
       
    return(psi)    

四、实例(Template)

# -*- coding: utf-8 -*-
"""
Created on Tue Nov 21 20:57:44 2017

@author: Hin
"""


%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from scipy.stats import randint as sp_randint
from Data_Processing import trainTestSplitV2, varThreshold, replaceOutlierNScale, featureSelectFromModel
from Multiple_Model_Selection import randomSearchCVTraining, gridSearchCVTraining, trainModelSequence, bestModel, featureReduce
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from Model_Evaluation import featureImpGraph, aucKSGraph, modelRank, PSI
import joblib
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml, PMMLPipeline

#-------------------------Set Label-------------------------
# Read csv with chinese characters and rename y as flag
dataset = pd.read_csv('ZZZ_test_purpose.csv', header = 0, encoding = 'gb18030')
dataset.rename(columns = {'def30_dup': 'flag'}, inplace = True)

# Get number of rows and columns of data
print("Number of Rows: ", dataset.shape[0])
print("Number of Columns: ", dataset.shape[1])

# Show missing values for each column
dataset.isnull().sum()
print('The Number of Missing Values: ', dataset.isnull().sum().sum())

# Split all the numric columns from data
# For test purpose, only deal with numeric data for the rest
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataset = dataset.select_dtypes(include = numerics)


#-------------------------Missing Imputaion-------------------------
# Use certain value for missing imputaion
dataset.replace([np.nan, np.inf, -np.inf], -1, inplace = True)


#-------------------------Train Test Split-------------------------
train, test = trainTestSplitV2(dataset, 'flag')


#-------------------------Feature Engineering and Feature Selection-------------------------
# Deal with catergorical variables(one-hot encoding): pd.get_dummies(df)  

# Removes all 0 variance features
elmi_col = varThreshold(dataset, train)
train.drop(elmi_col, axis = 1, inplace = True)
test.drop(elmi_col, axis = 1, inplace = True)

# Obtain all the important features based on certain threshold
imp_col =featureSelectFromModel(train, 'flag', figname = 'Top 10 Important Features', 
                                n_tree = 500, n_core = -1, rdm_state = None, thd = 'median', 
                                word = 1, show = 10, figsize = (12, 10))
train = train[(imp_col + ['flag'])]
test = test[(imp_col + ['flag'])]


#-------------------------Replacing Outliers and Scaling-------------------------
# Change test set first owing to the logic of the replaceOutlierNScale function

# Ignore nan when scale the data
train.replace(-1, np.nan, inplace = True)
test.replace(-1, np.nan, inplace = True)

quantile_limit, test = replaceOutlierNScale(train, test, 'flag', low = 0.02, high = 0.98, scale = True)
quantile_limit, train = replaceOutlierNScale(train, train, 'flag', low = 0.02, high = 0.98, scale = True)

# Fill nan data
train.replace([np.nan, np.inf, -np.inf], -1, inplace = True)
test.replace([np.nan, np.inf, -np.inf], -1, inplace = True)



#-------------------------Model Selection-------------------------
# Use same interface for several models
# classifiers consists of name, algorithm, parameters set
classifiers = []

## Logistic regression
#logreg_param = {'C': list(np.power(10.0, np.arange(-10, 10))),'penalty': ['l1','l2']}
#classifiers.append(['Logistic Regression', LogisticRegression(), logreg_param])

# SVM
svm_para = {'kernel':['linear','rbf'],
            'C': list(np.power(10.0, np.arange(-10, 3))),
            'gamma': list(np.logspace(-4,0,5)) + ['auto']}
classifiers.append(['SVM', SVC(probability = True), svm_para])

# Random Forest
rf_param = {'n_estimators': list(np.arange(600, 1100, 100)), 
            'max_depth': [3, 10, None],
            'max_features': [i/10 for i in range(1, 10)],
            'min_samples_split': sp_randint(2, 10),
            'min_samples_leaf': sp_randint(1, 10)}
classifiers.append(['RandomForest', RandomForestClassifier(), rf_param])


# XGBoost
xgb_param = {'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4],
             'n_estimators': list(range(100, 1100, 100)), # Number of trees
             'max_depth': list(range(1, 10, 2)), # Max depth of trees
             'gamma': list(np.arange(0, 0.5, 0.05)), # Minimum loss reduction required to make a further partition on a leaf node of the tree
             'subsample': [i/10 for i in range(5, 11)], # subsample ratio of the training data, row-wise
             'colsample_bytree': [i/10 for i in range(3, 11)], # subsample ratio of columns when constructing each tree
             'colsample_bylevel': [i/10 for i in range(1, 11)], # subsample ratio of columns for each split
             'reg_alpha': ([0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5] + list(np.arange(1, 10, 1)) + list(np.arange(10, 110, 10))), # L1 regularization term on weights
             'reg_lambda' : (list(np.arange(1, 10, 1)) + list(np.arange(10, 60, 10))), # L2 regularization term on weights
             'min_child_weight': sp_randint(1, 6),  # Defines the minimum sum of weights of all observations required in a child
             'max_delta_step': sp_randint(0, 11), # In maximum delta step we allow each tree’s weight estimation to be
             'random_state': sp_randint(0, 1000)} 
xgb = XGBClassifier(objective = 'binary:logistic', missing = None)
classifiers.append(['XGB', xgb, xgb_param])

model_set = trainModelSequence(train, 'flag', classifiers, method = randomSearchCVTraining,
                               iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1)

# Get best CV estimator
best_model = bestModel(model_set)


#-------------------------Feature Reduction-------------------------
# Reduce features from the best model above
# classifiers consists of name, algorithm, parameters set
classifiers = []


# XGBoost1
xgb_param = {'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4],
             'n_estimators': list(range(100, 1100, 100)), # Number of trees
             'max_depth': list(range(1, 10, 2)), # Max depth of trees
             'gamma': list(np.arange(0, 0.5, 0.05)), # Minimum loss reduction required to make a further partition on a leaf node of the tree
             'subsample': [i/10 for i in range(5, 11)], # subsample ratio of the training data, row-wise
             'colsample_bytree': [i/10 for i in range(3, 11)], # subsample ratio of columns when constructing each tree
             'colsample_bylevel': [i/10 for i in range(1, 11)], # subsample ratio of columns for each split
             'reg_alpha': ([0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5] + list(np.arange(1, 10, 1)) + list(np.arange(10, 110, 10))), # L1 regularization term on weights
             'reg_lambda' : (list(np.arange(1, 10, 1)) + list(np.arange(10, 60, 10))), # L2 regularization term on weights
             'min_child_weight': sp_randint(1, 6),  # Defines the minimum sum of weights of all observations required in a child
             'max_delta_step': sp_randint(0, 11), # In maximum delta step we allow each tree’s weight estimation to be
             'random_state': sp_randint(0, 1000)} 
xgb = XGBClassifier(objective = 'binary:logistic', missing = None)
classifiers.append(['XGB1', xgb, xgb_param])

# XGBoost2
classifiers.append(['XGB2', xgb, xgb_param])

# XGBoost3
classifiers.append(['XGB3', xgb, xgb_param])

# Pick one model that has good performance and less features in xgb_models
xgb_models = featureReduce(train, 'flag', classifiers, best_model['XGB'], method = randomSearchCVTraining,
                           iternum = 10, evalmetric = 'roc_auc', n_core = 4, fold = 5, word = 1)



#-------------------------Create Graph, Ranking and PSI-------------------------
# Feature Importance 
featureImpGraph(train[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], 
                figname = 'XGB2 Feature Importance', 
                show = 10, figsize = (12, 10))

# AUC and KS Graph
# For training set
aucKSGraph(train, 'flag', xgb_models[1]['XGB2'].predict_proba(np.array(train[xgb_models[1]['feature']].drop('flag', axis = 1)))[:, 1],
           pos_label = 1, model_name = 'Training', figsize = (12, 10))

# For testing set
aucKSGraph(test, 'flag', xgb_models[1]['XGB2'].predict_proba(np.array(test[xgb_models[1]['feature']].drop('flag', axis = 1)))[:, 1],
           pos_label = 1, model_name = 'Testing', figsize = (12, 10))

# Ranking
rank_train = modelRank(train[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], show = 20)
rank_test = modelRank(test[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'], show = 20)

# PSI 
PSI(train[xgb_models[1]['feature']], test[xgb_models[1]['feature']], 'flag', xgb_models[1]['XGB2'])


#-------------------------Save Model as PKL-------------------------
joblib.dump(xgb_models[1]['XGB2'], 'XGB2oost_xgb_models.pkl', compress = 3)
# XGB2_best = joblib.load("XGB2oost_XGB2_models.pkl") 

#-------------------------Save Model as PMML-------------------------
# XGB to PMML 
# xgb_models[1]['XGB2'].best_params_: Get best parameters from model
# xgb_models[1]['XGB2'].best_estimator_: Estimator that was chosen by the search
xgb_pipeline = PMMLPipeline([  
    ("mapper", DataFrameMapper([(i, None) for i in xgb_models[1]['feature'][:(len(xgb_models[1]['feature']) - 1)]])),    
    ("classifier", xgb_models[1]['XGB2'].best_estimator_)])    

# xgb_pipeline is a model which can also be used to predict    
xgb_pipeline.fit(train[xgb_models[1]['feature']].drop('flag', axis = 1), train[xgb_models[1]['feature']].flag)  

# PMML Transfer
sklearn2pmml(xgb_pipeline, "xgb.pmml", with_repr = True) 
原文地址:https://www.cnblogs.com/cgmcoding/p/13590573.html