Outliers
“I refuse to join any club that would have me as a member.” – Groucho Marx
This witty statement came from (according to me) one of the funniest men in the history of American cinema – Julius Henry Marx better known as Groucho Marx. Groucho was certainly a very unusual man and might be considered to be an outlier. Today we are going to discuss the impact of outliers on cluster analysis and life in general. An outlier is an observation that is distant / different from the others. I know statisticians get nightmares about outliers. A single outlier can create havoc in any analysis, hence the general tendency is to ignore them from the analysis or beat them back to normal (read data transformation to form normal distribution). At times the above techniques to deal with the outliers are necessary for the sake of analysis. However ignoring outliers altogether is something analysts / scientists / society can take at their own peril. The reason is outliers could be pointing towards a new emerging trend in the system. Today’s outlier may very well be tomorrow’s normal. Also, let’s face it outliers are so much more fun!
I have discussed a few outliers in the articles on YOU CANalytics such as Columbus, Turning, Euler, Sherlock Holmes, Leonardo da Vinci, Batman, and of course Groucho Marx. These men* have changed the course of human history in their own way. I hope to introduce more such outliers in the future articles. Interestingly, one of the striking feature about the human outliers is the treatment they receive from the society, similar to statistical outliers of getting ignored or beaten-up to convert to normal.
* I just noticed so far I have not introduced a woman outlier in my articles, will do it soon.
Telecom Case Study Example and Outliers
In the last few articles, we have been working on a case study example from the telecom sector where you are playing the role of the head of customer insights and marketing (Read Part 1 and Part 2). In those articles, you started with some fundamentals of cluster analysis. The business case was to create customer segments to understand your customer base better and enhance your company’s marketing campaigns. In the first part of this case study, we have created the clusters on our dataset with the following two variable – average international and local call duration. We chose a couple of random seeds and produced the adjustment clusters. These clusters were produced after recursive iteration where Euclidean distance plays a crucial role.
Now, let us add one more customer /data point to the above dataset with average local duration equal to 20 minutes and international duration equal to 10 minutes. This customer is clearly an outlier as can be seen in the adjustment plot. Now, let us try to perform cluster analysis on this modified dataset with the same original random seeds that we have used in the first article. The iterative results are shown in the animation below. There are total 4 iterations and notice how the cluster centroids are moving with each iteration. Additionally, also keep an eye for how cluster allegiance for each data point is changing with iterations. The color of data point represents cluster allegiance.
Clearly, the presence of an outlier has changed the entire result of our analysis. You must have noticed that the initial choice of two clusters or two random seeds was not that good with the addition of an outlier. In this case, the outlier became a single cluster and the remaining data points are formed into another cluster. If we had 3 cluster centroid-seeds in the beginning we would have seen a more reasonable cluster results. This has put forth an important question about the choice of number of cluster at the beginning of the analysis. Though one would like a simple answer to the question, trust me there isn’t one. There are a few analytical techniques that could serve well in this but at the end of the day the analyst needs to make a prudent choice based on her domain experience. We will discuss these analytical techniques in some other article.
Sign-off Note
I believe there is an outlier hidden somewhere in all of us. Outliers can make the world a better place to live. There is a need to not let these outliers be beaten into normal because of the life’s pressure we all go through. Let me finish this article with another classic statement from Groucho Marx who didn’t lose his witty self even when he was on his death bed. A nurse came and mentioned to frail Groucho that she wanted to measure if he had a temperature. Goucho retorted in his quick wit tone “Don’t be silly – everybody has a temperature”. Yes, this is similar to saying everybody has an outlier.
Hi Roopam,
Thanks for this exciting educating blog! Very useful for newbies and refresher for experienced people. It is exhilarating to know how you’ve been making relations of the real world subjects with business solutions: Here, universe and galaxies with clustering customer segmentations.
I came across an observation on the graphs that The X and Y axis are not labeled rightly. The X- has data mapping of Avg Local calls, and Y has International calls, but the labeling on images are otherwise. Please see if it has to be that way.