hierarchy 在大数据上聚类的利弊

Well, hierarchical clustering doesn't make that much sense for large datasets. It's actually mostly a textbook example in my opinion. The problem with hierarchical clustering is that it doesn't really build sensible clusters. It builds a dendrogram, but with 14000 objects the dendrogram becomes pretty much unusable. And very few implementations of hierarchical clustering have non-trivial methods to extract sensible clusters from the dendrogram. Plus, in the general case, hierarchical clustering is of complexityO(n^3) which makes it scale really bad to large datasets.

DBSCAN technically does not need a distance matrix. In fact, when you use a distance matrix, it will beslow, as computing the distance matrix already is O(n^2). And even then, you can safe the O(n^2)memory cost for DBSCAN by computing the distances on the fly at the cost of computing distances twice each. DBSCAN visits each point once, so there is next to no benefit from using a distance matrix except the symmetry gain. And technically, you could do some neat caching tricks to even reduce that, since DBSCAN also just needs to know which objects are below the epsilon threshold. When the epsilon is chosen reasonably, managing the neighbor sets on the fly will use significantly less memory thanO(n^2) at the same CPU cost of computing the distance matrix.

Any really good implementation of DBSCAN (it is spelled all uppercas