集成学习综述笔记

集成学习

**Ensemble methods 组合模型的方式大致为四个：/bagging / boosting / voting / stacking **

机器学习的算法有很多，对于每一种机器学习算法，考虑问题的方式都略微有所不同，所以对于同一个问题，不同的算法可能会给出不同的结果，那么在这种情况下，我们选择哪个算法的结果作为最终结果呢？那么此时，我们完全可以把多种算法集中起来，让不同算法对同一种问题都进行预测，最终少数服从多数，这就是集成学习的思路。
en's一种机器学习算法，考虑问题的方式都略微有所不同，所以对于同一个问题，不同的算法可能会给出不同的结果，那么在这种情况下，我们选择哪个算法的结果作为最终结果呢？那么此时，我们完全可以把多种算法集中起来，让不同算法对同一种问题都进行预测，最终少数服从多数，这就是集成学习的思路。

# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
  
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
  
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
  
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))


<script.py> output:
    Logistic Regression : 0.747
    K Nearest Neighbours : 0.724
    Classification Tree : 0.730

VotingClassifier

# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train,y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

<script.py> output:
    Voting Classifier: 0.753

惊喜的发现Voting Classifier的集成学习率达到了0.753，而集成之前的单个学习率Logistic Regression : 0.747， K Nearest Neighbours : 0.724，Classification Tree : 0.730，集成的魅力

我现在可不可以这样理解，我在做智能算法，两个算法两两组合，例如：花授粉算法与粒子群算法进行组合，不过这也不是完全的集成学习，只是集成了一部分，确实可以提升收敛效果。

bagging

特点

平行合奏：每个模型独立构建

旨在减少方差，而不是偏差（因此很可能存在过拟合）

适用于高方差低偏差模型（复杂模型）

基于树的方法的示例是随机森林，其开发完全生长的树（注意，RF修改生长的过程以减少树之间的相关性）

推导

输入
训练集(D=left{left(oldsymbol{x}_{1}, y_{1} ight),left(oldsymbol{x}_{2}, y_{2} ight), ldots,left(oldsymbol{x}_{m}, y_{m} ight) ight})
基学习算法：(mathcal{L})
训练次数：(T)
过程
for (t=1,2, dots, T mathrm{do})
(h_{t}=mathfrak{L}left(D, mathcal{D}_{b s} ight))
end for
输出
(H(oldsymbol{x})=underset{y in mathcal{Y}}{arg max } sum_{t=1}^{T} mathbb{I}left(h_{t}(oldsymbol{x})=y ight))

流程图

实现描述

在scikit-learn中，
参数 max_samples 和 max_features 控制子集的大小（在样本和特征方面）
参数 bootstrap 和 bootstrap_features 控制是否在有或没有替换的情况下绘制样本和特征。

Bagging又叫自助聚集，是一种根据均匀概率分布从数据中重复抽样（有放回）的技术。
每个抽样生成的自助样本集上，训练一个基分类器；对训练过的分类器进行投票，将测试样本指派到得票最高的类中。
每个自助样本集都和原数据一样大
有放回抽样，一些样本可能在同一训练集中出现多次，一些可能被忽略。
csdn

评价

Bagging通过降低基分类器的方差，改善了泛化误差
其性能依赖于基分类器的稳定性；如果基分类器不稳定，bagging有助于降低训练数据的随机波动导致的误差；如果稳定，则集成分类器的误差主要由基分类器的偏倚引起
由于每个样本被选中的概率相同，因此bagging并不侧重于训练数据集中的任何特定实例

BaggingClassifier参数介绍

   base_estimator：Object or None。None代表默认是DecisionTree，Object可以指定基估计器（base estimator）。

　　　　n_estimators：int, optional (default=10) 。要集成的基估计器的个数。

　　　　max_samples： int or float, optional (default=1.0)。决定从x_train抽取去训练基估计器的样本数量。int 代表抽取数量，float代表抽取比例

　　　　max_features : int or float, optional (default=1.0)。决定从x_train抽取去训练基估计器的特征数量。int 代表抽取数量，float代表抽取比例

　　　　bootstrap : boolean, optional (default=True) 决定样本子集的抽样方式（有放回和不放回）

　　　　bootstrap_features : boolean, optional (default=False)决定特征子集的抽样方式（有放回和不放回）

　　　　oob_score : bool 决定是否使用包外估计（out of bag estimate）泛化误差

　　　　warm_start : bool, optional (default=False) true代表

　　　　n_jobs : int, optional (default=1)

　　　　random_state : int, RandomState instance or None, optional (default=None)。如果int，random_state是随机数生成器使用的种子; 如果RandomState的实例，random_state是随机数生成器; 如果None，则随机数生成器是由np.random使用的RandomState实例。

　　　　verbose : int, optional (default=0)

属性介绍：

　　　　estimators_ : list of estimators。The collection of fitted sub-estimators.

　　　　estimators_samples_ : list of arrays

　　　　estimators_features_ : list of arrays

　　　　oob_score_ : float，使用包外估计这个训练数据集的得分。

　　　　oob_prediction_ : array of shape = [n_samples]。在训练集上用out-of-bag估计计算的预测。如果n_estimator很小，则可能在抽样过程中数据点不会被忽略。在这种情况下，oob_prediction_可能包含NaN。

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)

# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate acc_test
acc_test = accuracy_score(y_test, y_pred)
print('Test set accuracy of bc: {:.2f}'.format(acc_test))

<script.py> output:
    Test set accuracy of bc: 0.71

Out of Bag Evaluation

OOB_score

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, 
                       n_estimators=50,
                       oob_score=True,
                       random_state=1)

# Fit bc to the training set 
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate test set accuracy
acc_test = accuracy_score(y_test, y_pred)

# Evaluate OOB accuracy
acc_oob = bc.oob_score_

# Print acc_test and acc_oob
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))

<script.py> output:
    Test set accuracy: 0.698, OOB accuracy: 0.704

Random Forests (RF)

参考这篇文章
个人觉得，先搞明白每个分类器的原理然后，在进行集成学习于我个人而言比较有效果
https://www.cnblogs.com/gaowenxingxing/p/12345225.html