变量的选择——Lasso&Ridge&ElasticNet

对模型参数进行限制或者规范化能将一些参数朝着0收缩（shrink）。使用收缩的方法的效果提升是相当好的，岭回归（ridge regression，后续以ridge代称），lasso和弹性网络（elastic net）是常用的变量选择的一般化版本。弹性网络实际上是结合了岭回归和lasso的特点。

Lasso和Ridge比较

Lasso的目标函数：
Ridge的目标函数：
ridge的正则化因子使用二阶范数，虽然ridge可以将参数估计值向0收缩，但对于任何调优后的参数值，它都无法将某些参数值变为严格的0，尽管某些参数估计值变得非常小以至于可以忽略，但实际上它并没有进行变量选择。所以L1范数和L2范数正则化都有助于降低过拟合风险，但L1范数还带来一个额外的好处，它比L2范数更易于获得“稀疏(sparse)”解，即它所求的w会有更少的非零分量。
为何ridge到lasso，从L2范数变成L1范数，lasso就能够把参数估计收缩为0而ridge就不行呢？对于Lasso而言，优化下面两个方程是等价的：

也就是说，对每个超参λ，都存在相应的s值，使得上面两个方程优化后得到的参数估计相同。
类似的，对于Ridge，下面两个方程等价：

当参数维度p=2时，lasso的参数估计是在|β₁|+|β₂|<=s条件下，β₁和β₂最小化RSS的。ridge的参数估计是在β₁²+β₂²<=s的参数取值中最小化RSS的。当s很大时，限制条件几乎是无效的，lasso和ridge退化为最小二乘法，相反，如果s很小时，那么可能的参数取值范围就非常有限。

红线是平方误差项RSS的等值线，左侧青绿色的正方形是L1范数约束下的(β₁,β₂)的取值空间，右侧青绿色的圆形是L2范数约束下的(β₁,β₂)的取值空间。上面两个方程组的解要在平方误差项RSS和正则化项之间折中，及出现在图中平方误差项等值线与正则化项等值线相交处。从上图可以看出，使用L1范数时平方误差项等值线与正则化等值线的交点常常出现在坐标轴上，即w₁或者w₂为0，而在采用L2范数时，两者交点往往出现在某个象限中，即w₁或者w₂均非0，也就是说，L1范数比L2范数更易得到稀疏解。

弹性网络ElasticNet

弹性网络的目标函数：

弹性网络则是同时使用了L1和L2作为正则化项，ElasticNet在sklearn的地址：ElasticNet

参数中l1_ratio为L1范数惩罚项所占比例，0 <= l1_ratio <= 1。若l1_ratio =0时，弹性网络退化为ridge（只剩L2范数的惩罚项）。

参数中alpha即为上式中的α，越大对参数惩罚越大，越不容易过拟合。
使用样例:

 import numpy as np  
 from sklearn import linear_model  
 
 ###############################################################################  
 # Generate sample data  
 n_samples_train, n_samples_test, n_features = 75, 150, 500  
 np.random.seed(0)  
 coef = np.random.randn(n_features)  
 coef[50:] = 0.0  # only the top 10 features are impacting the model  
 X = np.random.randn(n_samples_train + n_samples_test, n_features)  
 y = np.dot(X, coef)  
 
 # Split train and test data  
 X_train, X_test = X[:n_samples_train], X[n_samples_train:]  
 y_train, y_test = y[:n_samples_train], y[n_samples_train:]  
 
 ###############################################################################  
 # Compute train and test errors  
 alphas = np.logspace(-5, 1, 60)  
 enet = linear_model.ElasticNet(l1_ratio=0.7)  
 train_errors = list()  
 test_errors = list()  
 for alpha in alphas:  
     enet.set_params(alpha=alpha)  
     enet.fit(X_train, y_train)  
     train_errors.append(enet.score(X_train, y_train))  
     test_errors.append(enet.score(X_test, y_test))  
 
 i_alpha_optim = np.argmax(test_errors)  
 alpha_optim = alphas[i_alpha_optim]  
 print("Optimal regularization parameter : %s" % alpha_optim)  
 
 # Estimate the coef_ on full data with optimal regularization parameter  
 enet.set_params(alpha=alpha_optim)  
 coef_ = enet.fit(X, y).coef_  
 
 ###############################################################################  
 # Plot results functions  
 
 import matplotlib.pyplot as plt  
 plt.subplot(2, 1, 1)  
 plt.semilogx(alphas, train_errors, label='Train')  
 plt.semilogx(alphas, test_errors, label='Test')  
 plt.vlines(alpha_optim, plt.ylim()[0], np.max(test_errors), color='k',  
         linewidth=3, label='Optimum on test')  
 plt.legend(loc='lower left')  
 plt.ylim([0, 1.2])  
 plt.xlabel('Regularization parameter')  
 plt.ylabel('Performance')  
 
 # Show estimated coef_ vs true coef  
 plt.subplot(2, 1, 2)  
 plt.plot(coef, label='True coef')  
 plt.plot(coef_, label='Estimated coef')  
 plt.legend()  
 plt.subplots_adjust(0.09, 0.04, 0.94, 0.94, 0.26, 0.26)  
 plt.show()

周志华：机器学习
http://www4.stat.ncsu.edu/~post/josh/LASSO_Ridge_Elastic_Net_-_Examples.html
http://blog.csdn.net/qq_21904665/article/details/52315642
http://blog.peachdata.org/2017/02/07/Lasso-Ridge.html