K-means algorithm----PRML读书笔记

The K-means algorithm is based on the use of squared Euclidean distance as the measure of dissimilarity between a data point and a prototype vector. Our goal is to partition the data set into some number K of clusters, where we shall suppose for the moment that the value of K is given. We can then define an objective function, sometimes called a distortion measure, given by J=Σ_nΣ_kr_nk||x_n-μ_k||²,where n=1,...N, k=1,...,K, N is observations of a random D-dimensional Euclidean variable x, K is number of clusters. J represents the sum of the squares of the distances of each data point to its assigned vector μ_k. We can think of the μ_k as representing the centres of the clusters. Our goal is to find values for the {r_nk} and the {μ_k} so as to minimize J. First we choose some initial values for the μ_k.Then in the first phase we minimize J with respect to the r_nk, keeping the μ_k fixed. In the second phase we minimize J with respect to μ_k, keeping r_nk fixed. This two-stage optimization is then repeated until convergence. We simply assign the n^th data point to the closest cluster centre, this can be expressed as r_nk=1,if k=argmin_j||x_n-μ_j||², otherwise r_nk=0. The objective function J is a quadratic function of μ_k, and it can be minimized by setting its derivative with respect to μ_kto zero giving 2Σ_nr_nk(x_n-μ_k)=0. μ_k=(Σ_nr_nkx_n)/(Σ_nr_nk), this result has a simple interpretation, namely set μ_k equal to the mean of all of the data points x_n assigned to cluster k. For this reason, the procedure is known as the K-means algorithm.