Click on Iterate to begin updating the cluster assignments.
This applet demonstrates a basic implementation of K-means clustering based on Lloyd's algorithm . The algorithm finds a set of means or centroids mk, k = 1,2,...,K such that each data point xi, i = 1,2,...,N can be partitioned based on their assignment to one of these K centroids. The centroids are continuously updated such that the distances between each point and their assigned centroids are minimized.
We begin by randomly assigning K points from the data to serve as our initial centroids. More efficient initializations leading to faster convergences can be found [2,3], but for a simple two-dimensional case we can make do with this basic procedure. Once the K centroids have been initialized, we assign each of the xi to a particular centroid, this assignment indicated by Ci, based on which centroid is closest to xi by the Euclidean distance.
After assignment, we obtain an update of the centroid by averaging across all currently assigned data points:
Then all points are again reassigned to their new clusters based on the updated positions of the centroids. At each step we measure the total (Euclidean) distance between each point and their assigned centroid. To control the range of distances, we obtain their logarithms.
Finally, convergence is achieved once the minimum distance per cluster:
The user may interact with the applet by clicking on the Iterate button, which performs the updating steps of Lloyd's algorithm. Three plots are provided to track the progress of the algorithm towards convergence.
The first plot visualizes the data points across its dimensions (only two are allowed for this applet) and colored based on their cluster assignments. Each cluster's nearby neighborhood is displayed as a circular backdrop whose radius is proportional to its number of members. As the user moves through iterations, these neighborhoods shift in location across the space and so do the data points change cluster assignments. Because of the way we initialize the clusters, it is possible that in the beginning, clusters may overlap or even be completely inscribed within another cluster. This is immediately corrected as soon the algorithm begins updating.
The second plot displays only the centroids as they are updated across iterations. The data points are provided in the backdrop to provide an idea of how the centroids are moving, and the members that are being included and excluded at each iteration.
The third and last plot tracks the improvements in within-group distances as the algorithm updates. Because the objective of the algorithm is to minimize this value to convergence, the algorithm stops and declares convergence when this line flattens out.
A table is also provided listing at each iteration the clusters and their centroid coordinates, along with the number of members, and the average distance of each member to the centroid.