Galaxies and Cluster Analysis
I live in Mumbai (Bombay), the financial capital of India and one of the largest cities in the world. One of the problems of living in a large city is that you rarely see stars in the night sky. The limited sky one can see through the skyscrapers is smeared with light pollution and it is difficult to sight stars, if any. One of the best night skies I have ever seen in my life was at Saint George Island on Gulf of Mexico, Florida. On a pitch-dark night during Floridian winters, one could see more than a million stars in the gorgeous night sky. It is a wonderful sight! My fascination for sky and stars is a possible reason for my fascination for physics. As I have mentioned earlier I have done my masters in physics and am ever curious about astrophysics and the origin of the universe. Let us try to understand the enormousness of the universe we can only fractionally see in the night sky.
Our planet, the earth, may seem like everything to us. However, we know it is just one of the nine (now eight) revolving planets around the sun. The sun is yet another star among around 200 billion stars in the galaxy Milky Way – the place where the sun and the earth reside. This is already enormous but to make it unfathomable, the universe has more than 200 billion galaxies. Using this one could approximate the number of stars in the universe i.e. ~ 4X1022 (from 200 billion X 200 billion, obviously these numbers are a gross approximation). I am happy we can see more than a million stars in a clear night sky, even if it is just a tiny fraction of actual number of stars. Now, we have the following two questions to answer
1) What are galaxies?
2) What is the relationship between galaxies and the title of this post (cluster analysis / customer segmentation)?
Galaxies are clusters of stars, gas, dust, planets and interstellar clouds. Usually, galaxies are spiral or elliptical in shape (shown in the picture from Wikipedia). The galaxies are separated from neighboring galaxies in three-dimensional space. Enormous black holes are often at the center of most galaxies. These black holes are the binding force providing distinct shapes to the galaxies.
As we will discuss cluster analysis in the next section, you will find striking similarities between galaxies and cluster analysis. As the galaxies are formed in three-dimensional space, cluster analysis is a multivariate analysis performed in n-dimensional space.
Note – keep the concept of black holes at the center of the galaxies in mind. We will use a similar concept of centroid for cluster analysis really soon.
Cluster Analysis – A Telecom Case Study
You are head of customer insights and marketing at a telecom company, ConnectFast Inc. You realize that not every customer is similar and you need to have different strategies to attract different customers. You appreciate the power of customer segmentation to deliver superior results with optimized cost. You are also aware of unsupervised learning techniques such as cluster analysis to create customer segments. To brush up your skills with cluster analysis, you have selected a sample of eight customers with their average call duration (both locally and internationally). The following is the data:
To get a feel for this, you have plotted the data with average international call duration on the x axis and average local call duration on the y axis. The following is the plot:
Note – this is similar to the cluster of stars in the night sky (here, stars are replaced with customers). Additionally, instead of a three-dimensional space we have a two-dimensional plane with average local and international call duration on the x-axis and y-axis. Now, like galaxies the task is to find the location of black holes; in cluster analysis they are called centroids. To locate the centroids, we start with assigning random points for the location of centroids.
Euclidian Distance to find Cluster Centroids
In this case, two centroids (C1 & C2) are randomly placed at the coordinates (1, 1) and (3, 4). Why did we choose two centroids? For this problem, visual estimation of scattered plot above informs us that are two clusters. However, we will notice in a latter part of this series, this question may not have such a straightforward answer for larger data sets.
Now, we will measure the distance between two centroids (C1 & C2) and all the data points on the above scattered plot using Euclidean measure. Euclidean distance is measured through the following formula
Columns 3 and 4 (i.e. Distance from C1 and C2) are measured using the same formula. For instance, for the first customer
You could measure all the other values similarly. Additionally, cluster membership (last column) is assigned using the closeness to clusters (C1 and C2). The first customer is closer to centroid 1 (1.41 in comparison to 2.24) hence is assigned membership C1.
The following is the scatter plot with cluster centroids C1 and C2 (displayed with blue and orange diamond shapes). The customers are marked with the color of centroids basis their closeness to the centroids.
As we have randomly assigned the centroids, the second step is to move them iteratively. The new position of centroid is measured by taking the average of member points for the centroid. For the first centroid, cutomers 1, 2 and 3 are members. Hence, the new x-axis position for the centroid C1 is the average value for x-axis for these customers i.e. (2+1+1)/3 = 1.33. We will get the new coordinates for C1 equal to (1.33, 2.33) and C2 equal to (4.4, 4.2). The new plot is shown below:
Finally, one more final iteration will take the centroids at the center of the clusters. As displayed below:
The positions for our black holes (cluster centroids) in this case turned out to be C1 (1.75, 2.25) and C2(4.75, 4.75). The two clusters above are like two galaxies separated in space from each other.
To me, the number of galaxies (~200 billion) and number of stars (~4X1022) rationalize the human position in the universe. If humans act separately from the universe and nature, mathematically they are insignificant. However, when we are one with this great creation, the Sanskrit phrase – Aham Bramhasmi (pronounced as: ah-HUM brah-MAHS-mee) – sums it up. It means ‘I am Brahma (The creator of the universe)’ & ‘I am the universe’. The creator and creation is one and boundless.
See you soon with more on cluster analysis and the telecom case.