Sherlock Holmes & Data Visualization
As a kid, a friend of mine used to own a Sherlock Holmes toy kit – the source of envy for all the other friends. The kit had a Sherlock Holmes cap, a pipe, a watch and a magnifying glass. The magnifying glass was the most coveted item in the kit. The pleasure of focusing the magnifying glass on an object and seeing it in detail to derive meaning was my first lesson in detective investigation – something that I still relish as an analyst. This is also the core of data visualization. Later, I learned more about Mr. Holmes through the books written by Sir Arthur Conon Doyle. The first book, A Study in Scarlet, describes Mr. Holmes’ inclination for scientific knowledge and the science of deduction – analysis. I realized being a detective is no different from being an experimental scientist or analyst. You start with gathering a set of observations, using which you built your case through logic and deduction. The following quote by Mr. Holmes’ perfectly describes the process of investigation – when you have eliminated the impossible, whatever remains, however improbable, must be the truth.
Data Visualization – A Case Study Example
In our last article, we started with a case study example about CyndiCat bank that has disbursed 60816 auto loans in the quarter between April–June 2012. You were playing the role of the Chief Risk Officer (CRO) for this bank. Additionally, you had noticed around 2.5% of bad rate or 1524 bad loans out of total 60816 disbursed loans. You started with a hunch about the relationship between the age of the borrowers and the bad rates. After your analysis, you observed a definitive inversely proportional relationship between the two. Age of the borrowers certainly seemed like a strong contender for your credit risk model. You are feeling good and want to find a few more variables for your multivariate model. (Read the previous article)
The Case Study Example Continues…
You also believe that income of the applicants should have some sort of relationship with the bad rates. You are feeling confident about your understanding of the tools you have used last time around i.e. histogram and normalized histogram (overlaid with good / bad borrowers). You immediately start by plotting an equal interval histogram and observe the following:
Ouch! This is nothing like the smooth bell curve histogram you have observed for the age groups. Even the normalized histogram, shown below, is completely uninformative.
So, what is going on here? Income, unlike age, has a few extreme outliers – almost invisible in the histogram. There is a High-Net worth-Individual (HNI) with $1.47 million annual salary and few other outliers in the middle. Incidentally, this loan to the HNI customer has gone bad – quite unfortunate for the Bank. Have a look at the distribution table – almost 99.8% population is in the first two income buckets.
Here, as an analyst, you need to take a call whether you want to include these extreme cases, with thin data, in your model or create an income boundary for which the model is applicable for the majority of the customers. In my opinion, the latter option is a prudent choice. Going further with your exploratory analysis and data visualization, you have decided to zoom into the regions with a predominant number of data points i.e. first two buckets and re-plotted the histogram. The following is what you observed
This time, the histogram is reasonably smooth and hence does not require transformation. Presented below is the normalized histogram for the above histogram.
The following conclusions can be drawn from the above
• There is a definite trend in terms of the bad rates and the income groups. As the borrowers are earning a higher salary, they are less likely to default on their loans. This seems like a good insight.
• For the Last bucket i.e. >150 K, the risk jumps up – a break in the trend. This is attributed to the thin data in this bucket – not just in terms of data count but this data is also spread across a very large interval 150 to 1500 K.
Now you have two variables that are possible governing bad rates for the borrowers – age and income. However, your further analysis of income with age shows that there is a high correlation between the two variables – 0.76 to be precise. You cannot use them both in the model because it will be problematic because of multicollinearity. The correlation between age and income makes sense. Since income is a function of years of experience for a professional, this further depends on upon her age. Hence, you have decided to drop income from the model. The leaves us with a question, is there a way of bringing income back in our multivariate model?
When corporate analysts try to analyze financials of a company they often work with several financial ratios. Working with ratios has a definite advantage over working with plain vanilla variables. Combined variables often convey much higher information. Seasoned analysts understand this really well. Moreover, variables creation is a creative exercise that requires sound domain knowledge. For credit analysis, the ratio of the sum of obligations to income is highly informative since this provides an insight about percentage disposable income for the borrower.
Let us try to understand this with an example. Susan has an annual income of 100 thousand dollars. She has a home loan with an annual obligation (EMI) of 40 thousand dollars and a car loan with 10 thousand dollars. Hence, she is spending 10+40 thousand dollars on paying the EMIs out of her income of 100 thousand dollars. Her Fixed Obligation to Income Ratio (FOIR) in this case is equal to 50/100 = 50%. She is left with just 50% of her income to run her other expenses.
The following is the normalized histogram plot for FOIR.
Clearly, there is a directly proportional relationship between FOIR and bad rate. Additionally, FOIR has little correlation with age, just 0.18. Now, you have another variable FOIR , along with age, for your multivariate model. Congratulations! Like, Sherlock Holmes, you are building your case evidence by evidence – a process in science.
I hope after this you are feeling inspired to pick up the magnifying glass and follow the legacy of the great Sherlock Holmes – this time the mystery is hiding in data!