What is K Means Clustering?
The K means clustering algorithm divides a set of n observations into k clusters. Use K means clustering when you don’t have existing group labels and want to assign similar data points to the number of groups you specify (K).
In general, clustering is a method of assigning comparable data points to groups using data patterns. Clustering algorithms find similar data points and allocate them to the same set. K means clustering is one such algorithm. A cluster is a group or set of similar data points, and I’ll use those terms synonymously.
“K means” refers to the following:
- The number of clusters you specify (K).
- The process of assigning observations to the cluster with the nearest center (mean).
K means clustering forms the groups in a manner that minimizes the variances between the data points and the cluster’s centroid. Learn more about Variances.
Imagine you have a large dataset of observations, but there is no grouping information or labels for the data points. Use K means clustering to generate groups comprised of observations with similar characteristics. For example, if you have customer data, you might want to create sets of similar customers and then target each group with different types of marketing.
K means clustering is a popular machine learning algorithm. It’s an unsupervised method because it starts without labels and then forms and labels groups itself. K means clustering is not a supervised learning method because it does not attempt to predict existing or known group labels.
K means clustering usage has grown recently thanks to machine learning taking off. But it started as a statistical grouping method for signal processing in 1957.
Read on to learn how the K means clustering algorithm works and see an example of it.
How the K Means Clustering Algorithm Works
The K Means Clustering algorithm finds observations in a dataset that are like each other and places them in a set. The process starts by randomly assigning each data point to an initial group and calculating the centroid for each one. A centroid is the center of the group. Note that some forms of the procedure allow you to specify the initial sets.
Then the algorithm continues as follows:
- It evaluates each observation, assigning it to the closest cluster. The definition of “closest” is that the Euclidean distance between a data point and a group’s centroid is shorter than the distances to the other centroids.
- When a cluster gains or loses a data point, the K means clustering algorithm recalculates its centroid.
- The algorithm repeats until it can no longer assign data points to a closer set.
When the K means clustering algorithm finishes, all groups have the minimum within-cluster variance, which keeps them as small as possible. Sets with minimum variance and size have data points that are as similar as possible. There is variability amongst the characteristics in each cluster, but the algorithm minimizes it.
In short, the observations within a set should share characteristics. Be sure to assess the final groups to be sure they make sense and satisfy your goals! In some cases, the analysts might need to specify different numbers of groups to determine which value of K produces the most useful results. For example, you might find that some sets are too large or too small to be helpful.
K Means Clustering Example
Imagine you’re studying businesses in a specific industry and documenting their information. Specifically, you record the variables shown in the dataset snippet below.
Download the full CSV dataset: KMeansClustering.
Now you want to group them into three clusters of similar businesses using these four variables.
Let’s run the analysis!
Interpreting the Results
The downloadable dataset contains the K mean clustering assignments for each business. We’ll look at some of the output to understand the groups.
The statistical output shows that K means clustering has created the following three sets with the indicated number of businesses in each:
- Cluster1: 6
- Cluster2: 10
- Cluster3: 6
We know each set contains similar businesses, but how do we characterize them? To do that, we need to look at the Cluster Centroids section. The output shows that Cluster 1 contains businesses with more clients, a higher rate of return, higher sales, and existed for more years. Cluster 3’s centroid has the lowest values. Cluster 2 is between them. You can describe the groups as the following:
- 1: Established industry leaders
- 2: Mid-growth businesses
- 3: Newer businesses
Frequently, examples of K means clustering use two variables that produce two-dimensional groups, which makes graphing easy. This example uses four variables, making the groups four-dimensional. You can’t graph them all on a single plot! This complexity highlights the value of using an algorithm to create the sets. Of course, larger datasets can have many more variables.
I can plot a pair of variables on a scatterplot to help you see the clusters. Keep in mind that K means clustering used the other two variables as well. Consequently, this graph is a two-dimensional slice of a four-dimensional space. Learn more about Scatterplots.
The scatterplot displays companies by their years of existence and sales. The graph color codes the groups in the data.
Ahmed says
Hi Prof and thank you !
Can p values be used to determine the significance of the number of similar images being clustered together as thumbnails on a canvass Vs the number of outliers on a canvass. In other words …. Ratio of similar thumbnails to thumbnail outliers on a canvass
The traditional k means performance tests for inter cluster and intra cluster are computationally too expensive for this large dataset of visual images. I chose the quality of the visual clustering output for my use case.
Warm regards
Jeremy says
Can k-means be used with categorical data?
Jim Frost says
Hi Jeremy,
No, you can’t use K means clustering with categorical data. K means minimizes distances between data points and centroids. Categorical data cannot be placed on a scale with distances between observations. Hence, it doesn’t work with that data type. In terms of nominal (categorical), ordinal, interval, and ratio scales, you really need at least an interval scale (for the distances) to be able to use K means clustering.
José Francisco dos Reis Neto says
Thanks for your post. I’ve been following you from here in Brazil, and I have your book. K means is important for analyzing the mass of data we manipulate.
Jim Frost says
Hi José! Greetings to you in lovely Brazil! Thanks so much for buying one of my books! 🙂
Thanks for writing. I think you’re correct that K means clustering has grown in importance due to the volume of data we now produce.
Mer says
Hello. Thank you for sharing your experience. Could you share some titles of good books on the analysis and design or surveys where those methods that you mentioned are also available.
Neil Higgs says
In the Survey research world, K-means has a very chequered reputation amongst those who have used it extensively. It is exceptionally sensitive to changes in input variables and to the choice of the number of clusters. The first is a particular problem unless there is an underlying theory about the relevant variables. Adding one variable can change the entire solution dramatically. I stopped using it years ago in favour of latent class analysis and latent continua, especially via the use of correspondence analysis. k-mean is also not great with non-continuous variables.
Jim Frost says
Hi Neil,
Thanks for sharing your experiences with K means clustering. It’s great when readers write about their real-world usage of statistical procedures. I think you raise some great points in your comments. First of all, not all tools work in all contexts. Secondly, it’s important to assess the results and determine whether they make sense given your subject-area knowledge. Don’t treat any procedure as a black box that just spits out an answer that you assume is correct. Verify!
Think of it this way. Classifying any set of observations using a set of variables can be tricky even when you have experts in the subject area doing the classification. There can be disagreements between experts on how to weight the different variables and where to draw lines between different categories. Now imagine doing that with a statistical algorithm that doesn’t use any subject-area knowledge but does assess the data using an objective approach that considers a large number of variables. There are bound to be pros and cons. Various difficulties. And so on. There is also the question of how many groups are optimal.
In short, I’m sure how well it works varies by area.
I think K means clustering has its place, particularly when you think the big data arena and where the risk of misclassification is small. Maybe you classify people into different groups and the based on the groups either target them with different types of advertising or product recommendations.
Funsho Olukade says
Thank you Prof. Clearly K means clustering as a statistical tool can help create meaning out of a cacophony of data just as your given example clearly demonstrated it.
Jim Frost says
Hi Funsho, thanks! And I love how you phrase it, “creating meaning out of a cacophony of data!” That’s exactly what it does!