机器学习类别不平衡处理之欠采样（undersampling）

类别不平衡就是指分类任务中不同类别的训练样例数目差别很大的情况

常用的做法有三种，分别是1.欠采样， 2.过采样， 3.阈值移动

由于这几天做的project的target为正值的概率不到4%，且数据量足够大，所以我采用了欠采样：

欠采样，即去除一些反例使得正、反例数目接近，然后再进行学习，基本的算法如下：

def undersampling(train, desired_apriori):

    # Get the indices per target value
    idx_0 = train[train.target == 0].index
    idx_1 = train[train.target == 1].index
    # Get original number of records per target value
    nb_0 = len(train.loc[idx_0])
    nb_1 = len(train.loc[idx_1])
    # Calculate the undersampling rate and resulting number of records with target=0
    undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
    undersampled_nb_0 = int(undersampling_rate*nb_0)
    print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
    print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))
    # Randomly select records with target=0 to get at the desired a priori
    undersampled_idx = shuffle(idx_0, n_samples=undersampled_nb_0)
    # Construct list with remaining indices
    idx_list = list(undersampled_idx) + list(idx_1)
    # Return undersample data frame
    train = train.loc[idx_list].reset_index(drop=True)

    return train

因为对应具体的project，所以里面欠采样的为反例，如果要使用的话需要做一些改动。

欠采样法若随机丢弃反例，可能会丢失一些重要信息。为此，周志华实验室提出了欠采样的算法EasyEnsemble：利用集成学习机制，将反例划分为若干个集合供不同学习器使用，这样对每个学习器来看都进行了欠采样，但在全局来看却不会丢失重要信息。其实这个方法可以再基本欠采样方法上进行些许改动即可：

def easyensemble(df, desired_apriori, n_subsets=10):
    train_resample = []
    for _ in range(n_subsets):
        sel_train = undersampling(df, desired_apriori)
        train_resample.append(sel_train)
    return train_resample

仔细来看，下图是原始论文Exploratory Undersampling for Class-Imbalance Learning里的算法介绍：

PS: 对于类别不平衡的时候采用CV进行交叉验证时，由于分类问题在目标分布上表现出很大的不平衡性。如果用sklearn库中的函数进行交叉验证的话，建议采用如StratifiedKFold 和 StratifiedShuffleSplit中实现的分层抽样方法，确保相对的类别概率在每个训练和验证折叠中大致保留。

Reference: