Cluster analysis is a powerful analytical technique to group or segment identical elements i.e. customers, products etc. In this series of articles, you will explore nuances of cluster analysis and its applications. Analytics challenges, on YOU CANalytics, are designed like puzzles where your participation is extremely important to move things forward. Hence, please share your thoughts and answers in the discussion section at the bottom.
Cluster analysis has several business applications where it plays a pivotal role:
– Lifestyle or psychographic segments for marketing: grouping customers into clusters based on their interests, and belief systems. This, in turn, helps the marketers offer the right product to the right customer.
– Product grouping: group products into relevant categories bases on product attributes. For instance, clubbing movies into different genres i.e. action, rom-com, horror etc.
– News/content categorization: identification of categories for media contents based on text mining, and organizing contents. Google does this extensively to show you the right content and news.
Cluster analysis is an unsupervised analytical technique. This does not mean that it runs on its own without any supervision. On the contrary, it requires the analysts to have an extremely good understanding of the business context and problem. This is essential to choose the right set of input variables for segmentation. This requires a greater degree of creativity and cognizance from the analysts. Let us explore how the initial choice of input variables can be critical for cluster analysis by creating a link between…
Twins Paradox and Cluster Analysis
The quintessential Bollywood script, in the 1970s and 80s, involved identical twins getting separated at birth to be reunited as adults. These twins get completely different upbringings. There was enough drama before the reunion that entertained the audience without fail. The lost-and-found siblings thus became the most trusted formula in the history of Bollywood cinema. By the way, identical twins are the closest example in nature for genetic similarity. When Bollywood was delivering hit after hit using the formula, a researcher on the other side of the globe was studying several identical twins separated at birth.
In his study, Thomas Bouchard Jr. of the University of Minnesota analyzed identical twins adopted by different families. The question was to identify the role of nature (genetics) and nurture (upbringing) on personality. Most of these twins didn’t meet each other after their separation at birth till they were fully-grown adults.
Results from the Twins Study
There are several interesting stories and patterns in this research. In one instance, twin brothers, James Lewis and James Springer, were living almost similar lives oblivious to each other’s existence. Both of them married and divorced their first wives named Linda to marry their second wives named Betty. They both named their childhood pets Toy. Guess what were the names of their firstborn sons? James Alan Lewis and James Allan Springer. In another instance, another twins, Oskar and Jack, were raised by families as dissimilar as chalk and cheese. Oskar was raised as a Nazi youth in Germany and Jack as a Jewish boy in the Caribbean. Despite this, on the day of their reunion as adults they unknowingly showed up wearing identical clothes. Apparently, independently these brothers had developed a similar taste in fashion.
These stories, however, do not prove that an identical twin is always a mirror image of her co-twin in terms of personality. There are enough instances where twins have completely different personalities to each other. Thomas Bouchard in his study was interested in understanding the role of proximity or upbringing on twins developing a similar personality. In a much wider analysis of a large number of twins, he found completely counter-intuitive results. The results suggest that proximity or upbringing plays absolutely no role in the development of personality or interests. A twin has the same probability of having a similar personality to his co-twin irrespective of whether they were brought up together or apart. The most significant parameter to develop a personality, according to the twins study, is nature or genes and not nurture or upbringing.
How is this Related to Cluster Analysis?
So, how is this related to the choice of variables in cluster analysis? Ok at this point, I must remind you that cluster analysis, unlike supervised machine learning methods, does not identify a small set of significant variables from the large list of input variables. Hence, it segments the population based on all the input variables. This is where analysts need to be careful with cluster analysis while choosing the appropriate input variables based on the problem at hand. Let me try to explain this using the results from the twins study.
Imagine based on the twins study, we have 200 different pairs of twins. 100 of these pairs were brought up together and 100 were brought up by different adopted parents. Also assume, 30% pairs of twins in either of these groups share a similar personality. Now, if you cluster these pairs of twins based on either proximity variables (i.e. shared houses, schools, parenting etc.) or personality variables (i.e. shared interests, attitudes, beliefs), you will get completely different sets of clusters. Neither of these sets of clusters are wrong but are appropriate based on the problem at hand. Cluster analysis is extensively used for customer segmentation and profiling. Customer segmentation is not very different from clustering the twins in our example. Therefore the choice of input variables determines the customer segmentation. This is similar to the choice of either proximity or personality variables for the twins.
Now, let’s take a nose dive towards the cluster analysis puzzle:
Analytics Challenge – Cluster Analysis
In this analytics challenge, we will use the k-mean clustering algorithm. We have discussed the k-mean algorithm in detail in this cluster analysis case study for telecom. You may want to read this case study to brush up your concepts for k-mean clustering.
First thing first, for this cluster analysis puzzle we will simulate some data with 2 input variables (x and y). We will use R to do all our calculation.
set.seed(57) x = c(rnorm(30,0,1),rnorm(30,10,1),rnorm(30,20,1)) y = c(rnorm(30,0,1),rnorm(30,10,1),rnorm(30,20,1)) a = as.data.frame(cbind(x,y)) plot(a) rm(x,y)
This dataset, as you could see in the plot, has three well-separated clusters.
Now, let’s see how the k-mean algorithm will perform on this. We will use the initial value for the number of clusters to be equal to 3 i.e. k=3. For this problem, this is easy since we can clearly see 3 different clusters. However, don’t expect the choice of k to be so simple for most datasets.
set.seed(42) kmean = kmeans(a,3) plot(a, col=kmean$cluster,pch=16) legend(-3,23,c('cluster 1','cluster 2','cluster 3'),pch= 16,col=c("black","green","red"))
Not bad, the k-mean algorithm has clustered the data in an expected manner. Now, let’s try to run the same algorithm with a different choice of initial seed.
set.seed(57) kmean1 = kmeans(a,3) plot(a, col=kmean1$cluster,pch=16) legend(-3,23,c('cluster 1','cluster 2','cluster 3'),pch= 16,col=c("black","green","red"))
Ouch! this doesn’t look right. Now we have got completely different clusters. Here are a few questions for discussion. Post your opinions/answers in the discussion section.
- What has gone wrong in the second analysis?
- Suggest a few strategies that will help to avoid such spurious and contradictory results.
- Choosing an appropriate value of k or number of clusters is essential to get the right results from cluster analysis. How do you choose the value of k in k-mean clustering?
Sign-off Note
Don’t treat these questions like an exam but be creative and imaginative while answering them. Remember, there are no wrong answers here. And who knows, we may find some completely novel ways to do cluster analysis through our discussion.
* What has gone wrong in the second analysis?
From what I understand, since the data already has a pattern, the seed does not have an effect on the kmeans function grouping them. Having other seed initiations does not affect the kmeans function from choosing the most optimum centers. But when the seed is reinitiated, the kmeans functions is forced to randomize the centers from ‘the most optimum’ to ‘the next most optimum’ centers for grouping them.
The reason of the difference between first one and second one is the randon seed, the radom seed randomly gives initial central points of each clusters, which affects the final clustering result.
One way to avoid the effect from randomness is to get the initial central points that are as far away from each other as possible.
I couldn’t figure out the significance of the seed. I tried the seed number to be 41 to 44 and it worked well. However, beyond this number it is not working well. However, I have used the following code to figure out the optimum number of clusters.
library(NbClust)
## it suggests the optimum number of clusters
NbClust(a, method = ‘complete’, index = ‘all’)$Best.nc[1,]## it says optimum no. of cluster is 3
## Now apply the k means clustering.
cluster <- kmeans(a, 3) ##using the optimum cluster of 3
cluster
plot(a, col=cluster$cluster,pch=16)
legend(-3,23,c('cluster 1','cluster 2','cluster 3'),pch= 16,col=c("black","green","red"))
*** 1) What has gone wrong in the second analysis? ***
Nothing gone wrong. Since k-means algorithms results derives from an iterative process, the random seed from the second experiment made the method to converge to a different result.
*** 3) Choosing an appropriate value of k or number of clusters is essential to get the right results from cluster analysis. How do you choose the value of k in k-mean clustering? ***
In order to decide the optimal number of “k”, I would identify if the groups are really different based in univariate and bivariate analysis and also if the groups respect business understanding.
1. Used set.seed(57) to generate data. In 1st run a different integer (42) used to generate initial cluster centers, which might have chosen initial centers that are from each other and it has converged well with max iterations (10). In second case, again set.seed(57) used, which has caused it to choose center points from left bottom cluster. Algorithm did not converge well even after 10 iterations, hence bad clusters.
2. Selection of K itself is a research topic. I would check how each attribute is distributed (univariate analysis) and choose a number.
Could you run the K-means a number of times with different seeds. With such clearly delineated groups most random seeds will result in the ‘correct’ clustering. You can then throw away the minority reports. If you do not get a significant proportion of the seeds generating the same result then either you have the wrong number of Ks or there is no pattern.
Agree with Aaron Reese. From my reading, one would not perform this type of analysis once but replicated many times to determine an appropriate number of clusters.