A Scientist & An Artist
A few weeks ago while wandering around in Florence, the birthplace of the Renaissance, I could not escape the thought of Leonardo da Vinci : the greatest polymath of all times. Leonardo’s illustrious resume contains titles such as painter, inventor, physicist, astronomer, engineer, biologist, anatomist, geologist, and architect – no kidding! A smart cat would have to live all her nine lives to acquire the nine titles Leonardo had mastered in one lifetime. Today, while discussing facets of data visualization, we should pay homage to Uncle Leonardo as we cross the realm of both art and science.
Art and Science of Data Visualization
Data visualization, as mentioned earlier, is both art and science. I personally prefer to have a long look at the data, plotting them in various ways before jumping into rigorous mathematical modeling. You might have noticed my penchant for art while going through my artwork presented in all the posts on this blog. The saying – a picture is worth thousand words – holds true during data analysis as well. Models in analytics can go horribly wrong if you have not spent enough time on the data exploratory phase – which is all about data visualization to me. Let me present a case study example to explain the aspects of data visualization during the exploratory phase.
Banking Case Study Example – Risk Management
Assume you are the chief risk officer (CRO) for CyndiCat bank that has disbursed 60816 auto loans in the quarter between April–June 2012. Today, about a year and a quarter since the loans disbursal, you know that the loans have seasoned or bad loans are tagged to a greater certainty (read a detailed discussion). You have noticed a bad rate of around 2.5% or 1524 bad loans out of total 60816 disbursed loans.
Before you jump to multivariate analysis and credit scoring (read a detailed discussion on credit scoring), you want to analyze the bad rate across several individual variables. You have a hunch based on your experience that borrower’s age at the time of loan disbursal is a key distinguishing factor for bad rates. Therefore, you have divided the loans based on the age of the borrowers and created a table something like the one below.
Using the above table, you have created a histogram and zoomed into the area of interest (close to the bad loans) as shown in the plots below.
You must have noticed the following
• The distribution of loans across age groups is a reasonably smooth normally distributed curve, without too many outliers. Age often display this kind of pattern for most products. However, do not expect similar smooth curves for other commonly found variables in a business scenario. Often, you may have to resolve to variable transformation to make the distributions smooth.
• The maximum bad loans are in the age bucket 42 to 45 years. This certainly does not mean the risk is also the highest in this bucket, however, once I have heard someone drawing a similar conclusion in a quarterly business review meeting –a silly mistake. Note, the maximum loans are also in the bucket 42 to 45 years. Absolute numbers do not provide enough information hence we need to create a normalized plot.
• The data is really thin on the fringe buckets (i.e. <21 and >60 years groups) with only 9 and 6 data points – be careful when dealing with such thin data. Sound business knowledge to modify these fringe buckets is extremely helpful while a model development. For instance, you know that for age above 60 for loans could be highly risky, but in this data, we do not have enough evidence for the same since we do not have enough data to validate our hypothesis. We should supplement a right risk weight in such situation – however, be very careful while doing so.
The normalized plot is easy to construct. The idea is to scale each age group to 100% and overlay bad and good percentage of records on top. We could extend the table shown above to get the values for the normalized plot as shown below.
Now, once you have the table ready you could create a normalized plot quite easily, as shown below (again we have zoomed into the plot to get a clear view of bad rates).
These plots are completely different from the original frequency count plot and presenting the information in a completely different light. The following are the things one could conclude from the plots.
• There is a definite trend in terms of the bad rates and the age groups. As the borrowers are getting older, they are less likely to default on their loans. That is a good insight.
• Again, the fringes (i.e. <21 and >60 years groups) have thin data, this information cannot be obtained from the normalized plot. Hence, you need to have the frequency plot handy to treat thin data differently. A handy rule of thumb is to have at least 10 records of both (good & bad) cases before taking the information seriously – otherwise, it is not statistically significant.
I must conclude by saying that, data visualization is the beginning of modeling process and not the destination. However, it is a good & creative beginning.
With big data, data analysis tools & technologies, scientific progress and democratic environment – we could be living in the Renaissance of our times. However, we will need more Leonardo da Vincis to make these times really special.