不平衡学习方法理论和实战总结

原文：http://blog.csdn.net/hero_fantao/article/details/35784773

不平衡学习方法

机器学习中样本不平衡问题大致分为两方面：

（1）类别中样本比率不平衡，但是几个类别的样本都足够多；

（2）类别中某类样本较少。

对第二个问题，其实不是我们重点，因为样本不足的话，覆盖空间是很小，如果特征足够多的话，这种数据对模型学习的价值也不大，所以，对这个问题，好的方法只能是找尽量多的小类样本来覆盖样本空间。

现在主要讨论第一个问题。

一: 采样方法

1. 随机重采样(random oversampling):

样本不平衡时候，对小类样本就行随机重采样，以达到平衡。这种方法只是对小类样本进行简单的拷贝，缺点是容易over-fit，比如在决策树分类的时候，很有可能一个终端叶子节点的样本都是一个样本的拷贝而已，扩展性不足，这可能会提高模型训练的精度，但是对未知测试样本的预测可能是很差的。

2. 随机欠采样(random oversampling)：

样本不平衡时候，对大类样本就行随机欠采样，就是取部分大类样本，以达到平衡。欠采样的问题是对样本减少可能会缺失样本空间中重要数据，降低准确性。

3. Synthetic Sampling with Data Generation

对小类样本进行近似数据样本生成。对小类样本计算KNN，找出K个相近样本，根据K近邻样本于当前样本的距离，生成新的样本。

这种方法突破了原有的简单的重复采样的方法，通过创建新的小样本，丰富了小样本的样本空间，弥补了小样本样本空间不足的问题。缺点是它对所有的小类样本都计算相同的KNN。试想下对于那些和大类样本有明显的区分度的小样本，对于这些产生多余的样本价值不大。

4. Adaptive Synthetic Sampling

Adaptive Synthetic Sampling是一种修正方法，他试图增加小样本中和大类样本比较相近的样本sampling。

方法如下：

二代价学习方法

一是从样本角度来看，尽量做到样本平衡，然后来用模型的学习。还有种就是通过设置不同样本误判的代价，比如设置小样本误判的代价大一些。个人的感觉，这种方法和一中重采样的效果差不多，牺牲一个换取另外一个。个人觉得一种好的方法是，正负样本不平衡时候，每次选取一部分大类样本和全部小样本，尽量平衡，训练一个模型。重复以上操作，训练得到若干模型，把这些模型做个voting，获得最终预测结果，可以效仿Adaboost，对每个模型进行加权。其实，voting的方法就能达到很不多的效果。

参考文献：

[1] He H, Garcia E A. Learning from imbalanced data[J]. Knowledge and Data Engineering, IEEE Transactions on, 2009, 21(9): 1263-1284.

[2] https://github.com/fmfn/UnbalancedDataset(2014/12/07 @phunter_lau分享的一个模块)

附上Adaptive Synthetic Sampling源码：

[python] view plain copy

'''''
Created on 2014/03/09
@author: dylan
'''
from sklearn.neighbors import NearestNeighbors
import numpy as np
import random
def get_class_count(y, minorityclasslabel = 1):
minorityclasslabel_count = len(np.where(y == minorityclasslabel)[0])
maxclasslabel_count = len(np.where(y == (1 - minorityclasslabel))[0])
return maxclasslabel_count, minorityclasslabel_count
# @param: X The datapoints e.g.: [f1, f2, ... ,fn]
# @param: y the classlabels e.g: [0,1,1,1,0,...,Cn]
# @param ms: The amount of samples in the minority group
# @param ml: The amount of samples in the majority group
# @return: the G value, which indicates how many samples should be generated in total, this can be tuned with beta
def getG(ml, ms, beta):
return (ml-ms)*beta
# @param: X The datapoints e.g.: [f1, f2, ... ,fn]
# @param: y the classlabels e.g: [0,1,1,1,0,...,Cn]
# @param: minorityclass: The minority class
# @param: K: The amount of neighbours for Knn
# @return: rlist: List of r values
def getRis(X,y,indicesMinority,minorityclasslabel,K):
ymin = np.array(y)[indicesMinority]
Xmin = np.array(X)[indicesMinority]
neigh = NearestNeighbors(n_neighbors= K)
neigh.fit(X)
rlist = [0]*len(ymin)
normalizedrlist = [0]*len(ymin)
for i in xrange(len(ymin)):
indices = neigh.kneighbors(Xmin[i],K,False)[0]
# print'y[indices] == (1 - minorityclasslabel):'
# print y[indices]
# print len(np.where(y[indices] == ( 1- minorityclasslabel))[0])
rlist[i] = len(np.where(y[indices] == ( 1- minorityclasslabel))[0])/(K + 0.0)
normConst = sum(rlist)
for j in xrange(len(rlist)):
normalizedrlist[j] = (rlist[j]/normConst)
return normalizedrlist
def get_indicesMinority(y, minorityclasslabel = 1):
y_new = []
for i in range(len(y)):
if y[i] == 1:
y_new.append(1)
else:
y_new.append(0)
y_new = np.asarray(y_new)
indicesMinority = np.where(y_new == minorityclasslabel)[0]
return indicesMinority, y_new
def generateSamples(X, y, minorityclasslabel = 1, K =5,beta = 0.3):
syntheticdata_X = []
syntheticdata_y = []
indicesMinority, y_new = get_indicesMinority(y)
ymin = y[indicesMinority]
Xmin = X[indicesMinority]
rlist = getRis(X, y_new, indicesMinority, minorityclasslabel, K)
ml, ms = get_class_count(y_new)
G = getG(ml,ms, beta = beta)
neigh = NearestNeighbors(n_neighbors=K)
neigh.fit(Xmin)
for k in xrange(len(ymin)):
g = int(np.round(rlist[k]*G))
neighb_indx = neigh.kneighbors(Xmin[k],K,False)[0]
for l in xrange(g):
ind = random.choice(neighb_indx)
s = Xmin[k] + (Xmin[ind]-Xmin[k]) * random.random()
syntheticdata_X.append(s)
syntheticdata_y.append(ymin[k])
print 'asyn, raw X size:',X.shape
X = np.vstack((X,np.asarray(syntheticdata_X)))
y = np.hstack((y,syntheticdata_y))
print 'asyn, post X size:',X.shape
return X , y