scikit-learn（project中用的相对较多的模型介绍）：2.3. Clustering（可用于特征的无监督降维）

參考：http://scikit-learn.org/stable/modules/clustering.html

在实际项目中，我们真的非常少用到那些简单的模型，比方LR、kNN、NB等。尽管经典，但在project中确实不有用。

今天我们不关注详细的模型，而关注无监督的聚类方法。

之所以关注无监督聚类方法。是由于。在实际项目中，我们除了使用PCA等方法降维外。有时候我们也会考虑使用聚类的方法降维特征。

Overview of clustering methods：

A comparison of the clustering algorithms in scikit-learn

Method name	Parameters	Scalability	Usecase	Geometry (metric used)
K-Means	number of clusters	Very large `n_samples`, medium `n_clusters`with MiniBatch code	General-purpose, even cluster size, flat geometry, not too many clusters	Distances between points
Affinity propagation	damping, sample preference	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Mean-shift	bandwidth	Not scalable with`n_samples`	Many clusters, uneven cluster size, non-flat geometry	Distances between points
Spectral clustering	number of clusters	Medium `n_samples`, small `n_clusters`	Few clusters, even cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering	number of clusters	Large `n_samples` and`n_clusters`	Many clusters, possibly connectivity constraints	Distances between points
Agglomerative clustering	number of clusters, linkage type, distance	Large `n_samples` and`n_clusters`	Many clusters, possibly connectivity constraints, non Euclidean distances	Any pairwise distance
DBSCAN	neighborhood size	Very large `n_samples`, medium `n_clusters`	Non-flat geometry, uneven cluster sizes	Distances between nearest points
Gaussian mixtures	many	Not scalable	Flat geometry, good for density estimation	Mahalanobis distances to centers
Birch	branching factor, threshold, optional global clusterer.	Large `n_clusters` and`n_samples`	Large dataset, outlier removal, data reduction.	Euclidean distance between points