The algorithm starts with an initial set of cluster centers, chosen at random or according to some heuristic procedure. In each iteration, each instance is assigned to its nearest cluster center according to the Euclidean distance between the two. Then the cluster centers are re-calculated. The center of each cluster is calculated as the mean of all the instances belonging to that cluster.
A number of convergence conditions are possible. For example, the search may stop when the partitioning error is not reduced by the relocation of the centers. This indicates that the present partition is locally optimal. Other stopping criteria can be used also such as exceeding a pre-defined number of iterations.
Input: S (instance set), K (number of cluster)
Output: clusters
- : Initialize K cluster centers.
- : while termination condition is not satisfied do
- : Assign instances to the closest cluster center.
- : Update cluster centers based on the assignment.
- : end while
The K-means algorithm may be viewed as a gradient-decent procedure, which begins with an initial set of K cluster-centers and iteratively updates it so as to decrease the error function.
The complexity of T iterations of the K-means algorithm performed on a sample size of m instances, each characterized by N attributes, is: O(T ∗ K ∗ m ∗ N). This linear complexity is one of the reasons for the popularity of the Kmeans algorithms. Even if the number of instances is substantially large, this algorithm is computationally attractive. Thus, the K-means algorithm has an advantage in comparison to other clustering methods (e.g. hierarchical clustering methods), which have non-linear complexity.
Other reasons for the algorithm’s popularity are its ease of interpretation,simplicity of implementation, speed of convergence and adaptability to sparse data. In addition, this algorithm is sensitive to noisy data and outliers (a single outlier can increase the squared error dramatically); it is applicable only when mean is defined (namely, for numeric attributes);and it requires the number of clusters in advance, which is not trivial when no prior knowledge is available. The use of the K-means algorithm is often limited to numeric attributes.
Comments
Post a Comment