What is K Means Clustering?
The K means clustering algorithm divides a set of n observations into k clusters. Use K means clustering when you don’t have existing group labels and want to assign similar data points to the number of groups you specify (K).
In general, clustering is a method of assigning comparable data points to groups using data patterns. Clustering algorithms find similar data points and allocate them to the same set. K means clustering is one such algorithm. A cluster is a group or set of similar data points, and I’ll use those terms synonymously.
“K means” refers to the following:
- The number of clusters you specify (K).
- The process of assigning observations to the cluster with the nearest center (mean).
K means clustering forms the groups in a manner that minimizes the variances between the data points and the cluster’s centroid. Learn more about Variances.
Imagine you have a large dataset of observations, but there is no grouping information or labels for the data points. Use K means clustering to generate groups comprised of observations with similar characteristics. For example, if you have customer data, you might want to create sets of similar customers and then target each group with different types of marketing.
K means clustering is a popular machine learning algorithm. It’s an unsupervised method because it starts without labels and then forms and labels groups itself. K means clustering is not a supervised learning method because it does not attempt to predict existing or known group labels.
K means clustering usage has grown recently thanks to machine learning taking off. But it started as a statistical grouping method for signal processing in 1957.
Read on to learn how the K means clustering algorithm works and see an example of it.
How the K Means Clustering Algorithm Works
The K Means Clustering algorithm finds observations in a dataset that are like each other and places them in a set. The process starts by randomly assigning each data point to an initial group and calculating the centroid for each one. A centroid is the center of the group. Note that some forms of the procedure allow you to specify the initial sets.
Then the algorithm continues as follows:
- It evaluates each observation, assigning it to the closest cluster. The definition of “closest” is that the Euclidean distance between a data point and a group’s centroid is shorter than the distances to the other centroids.
- When a cluster gains or loses a data point, the K means clustering algorithm recalculates its centroid.
- The algorithm repeats until it can no longer assign data points to a closer set.
When the K means clustering algorithm finishes, all groups have the minimum within-cluster variance, which keeps them as small as possible. Sets with minimum variance and size have data points that are as similar as possible. There is variability amongst the characteristics in each cluster, but the algorithm minimizes it.
In short, the observations within a set should share characteristics. Be sure to assess the final groups to be sure they make sense and satisfy your goals! In some cases, the analysts might need to specify different numbers of groups to determine which value of K produces the most useful results. For example, you might find that some sets are too large or too small to be helpful.
K Means Clustering Example
Imagine you’re studying businesses in a specific industry and documenting their information. Specifically, you record the variables shown in the dataset snippet below.
Download the full CSV dataset: KMeansClustering.
Now you want to group them into three clusters of similar businesses using these four variables.
Let’s run the analysis!
Interpreting the Results
The downloadable dataset contains the K mean clustering assignments for each business. We’ll look at some of the output to understand the groups.
The statistical output shows that K means clustering has created the following three sets with the indicated number of businesses in each:
- Cluster1: 6
- Cluster2: 10
- Cluster3: 6
We know each set contains similar businesses, but how do we characterize them? To do that, we need to look at the Cluster Centroids section. The output shows that Cluster 1 contains businesses with more clients, a higher rate of return, higher sales, and existed for more years. Cluster 3’s centroid has the lowest values. Cluster 2 is between them. You can describe the groups as the following:
- 1: Established industry leaders
- 2: Mid-growth businesses
- 3: Newer businesses
Frequently, examples of K means clustering use two variables that produce two-dimensional groups, which makes graphing easy. This example uses four variables, making the groups four-dimensional. You can’t graph them all on a single plot! This complexity highlights the value of using an algorithm to create the sets. Of course, larger datasets can have many more variables.
I can plot a pair of variables on a scatterplot to help you see the clusters. Keep in mind that K means clustering used the other two variables as well. Consequently, this graph is a two-dimensional slice of a four-dimensional space. Learn more about Scatterplots.
The scatterplot displays companies by their years of existence and sales. The graph color codes the groups in the data.