Sherlock Holmes & Data Visualization
As a kid, a friend of mine used to own a Sherlock Holmes toy kit – the source of envy for all the other friends. The kit had a Sherlock Holmes cap, a pipe, a watch and a magnifying glass. The magnifying glass was the most coveted item in the kit. The pleasure of focusing the magnifying glass on an object and seeing it in detail to derive meaning was my first lesson in detective investigation – something that I still relish as an analyst. This is also the core of data visualization. Later, I learned more about Mr. Holmes through the books written by Sir Arthur Conon Doyle. The first book, A Study in Scarlet, describes Mr. Holmes’ inclination for scientific knowledge and the science of deduction – analysis. I realized being a detective is no different from being an experimental scientist or analyst. You start with gathering a set of observations, using which you built your case through logic and deduction. The following quote by Mr. Holmes’ perfectly describes the process of investigation – when you have eliminated the impossible, whatever remains, however improbable, must be the truth.
Data Visualization – A Case Study Example
In our last article, we started with a case study example about CyndiCat bank that has disbursed 60816 auto loans in the quarter between April–June 2012. You were playing the role of the Chief Risk Officer (CRO) for this bank. Additionally, you had noticed around 2.5% of bad rate or 1524 bad loans out of total 60816 disbursed loans. You started with a hunch about the relationship between the age of the borrowers and the bad rates. After your analysis, you observed a definitive inversely proportional relationship between the two. Age of the borrowers certainly seemed like a strong contender for your credit risk model. You are feeling good and want to find a few more variables for your multivariate model. (Read the previous article)
The Case Study Example Continues…
You also believe that income of the applicants should have some sort of relationship with the bad rates. You are feeling confident about your understanding of the tools you have used last time around i.e. histogram and normalized histogram (overlaid with good / bad borrowers). You immediately start by plotting an equal interval histogram and observe the following:
Ouch! This is nothing like the smooth bell curve histogram you have observed for the age groups. Even the normalized histogram, shown below, is completely uninformative.
So, what is going on here? Income, unlike age, has a few extreme outliers – almost invisible in the histogram. There is a High-Net worth-Individual (HNI) with $1.47 million annual salary and few other outliers in the middle. Incidentally, this loan to the HNI customer has gone bad – quite unfortunate for the Bank. Have a look at the distribution table – almost 99.8% population is in the first two income buckets.
Here, as an analyst, you need to take a call whether you want to include these extreme cases, with thin data, in your model or create an income boundary for which the model is applicable for the majority of the customers. In my opinion, the latter option is a prudent choice. Going further with your exploratory analysis and data visualization, you have decided to zoom into the regions with a predominant number of data points i.e. first two buckets and re-plotted the histogram. The following is what you observed
This time, the histogram is reasonably smooth and hence does not require transformation. Presented below is the normalized histogram for the above histogram.
The following conclusions can be drawn from the above
• There is a definite trend in terms of the bad rates and the income groups. As the borrowers are earning a higher salary, they are less likely to default on their loans. This seems like a good insight.
• For the Last bucket i.e. >150 K, the risk jumps up – a break in the trend. This is attributed to the thin data in this bucket – not just in terms of data count but this data is also spread across a very large interval 150 to 1500 K.
Now you have two variables that are possible governing bad rates for the borrowers – age and income. However, your further analysis of income with age shows that there is a high correlation between the two variables – 0.76 to be precise. You cannot use them both in the model because it will be problematic because of multicollinearity. The correlation between age and income makes sense. Since income is a function of years of experience for a professional, this further depends on upon her age. Hence, you have decided to drop income from the model. The leaves us with a question, is there a way of bringing income back in our multivariate model?
Financial Ratios
When corporate analysts try to analyze financials of a company they often work with several financial ratios. Working with ratios has a definite advantage over working with plain vanilla variables. Combined variables often convey much higher information. Seasoned analysts understand this really well. Moreover, variables creation is a creative exercise that requires sound domain knowledge. For credit analysis, the ratio of the sum of obligations to income is highly informative since this provides an insight about percentage disposable income for the borrower.
Let us try to understand this with an example. Susan has an annual income of 100 thousand dollars. She has a home loan with an annual obligation (EMI) of 40 thousand dollars and a car loan with 10 thousand dollars. Hence, she is spending 10+40 thousand dollars on paying the EMIs out of her income of 100 thousand dollars. Her Fixed Obligation to Income Ratio (FOIR) in this case is equal to 50/100 = 50%. She is left with just 50% of her income to run her other expenses.
The following is the normalized histogram plot for FOIR.
Clearly, there is a directly proportional relationship between FOIR and bad rate. Additionally, FOIR has little correlation with age, just 0.18. Now, you have another variable FOIR , along with age, for your multivariate model. Congratulations! Like, Sherlock Holmes, you are building your case evidence by evidence – a process in science.
Sign-off Note
I hope after this you are feeling inspired to pick up the magnifying glass and follow the legacy of the great Sherlock Holmes – this time the mystery is hiding in data!
I agree that our roles as data scientists are like detective work and found both posts easy-to-read and made analytics more exciting.
As a fellow detective though, I am wondering when a rho of 0.76 between two candidate explanatory variables became a barrier to using both of them together in a predictive model. Many would argue that multicollinearity per se (unless severe) may safely be ignored if the goal is prediction. The model that predicts the best wins according to this view. I tend to use VIFs (Variance Inflation Factors) and condition indices to diagnose when multicollinearity is a problem in an OLS model & use a diagnostic OLS model to measure the same even when using a logit or other non-linear model.
Often, the relationship between age and income is actually non-linear based as suggested by economist Modigliani over 60 years ago. When we are young (on average) spending exceeds income as income is lower than it is at “middle” age when earning power is at its peak and income exceeds spending allowing saving. As one ages into retirement, we dis-save— that is again spending is higher than income but rather than borrow against future earnings as we did while young, the spending that exceeds income comes from savings. But at both ends of life income and consumption are lower than in the middle as illustrated below…
The example becomes a bit confusing to me without deciling income into equal size frequency bands. It also appears that the x-axis is mislabeled with income bands when the axis label says Age Groups…? Also if we have but one loan between 1400-1500k, how can it be 5% of loans in the second graph?
Thanks Dean for making some good points in your comment. Firstly, I have marked x axis wrongly in one of the histograms where it should be income groups. Thanks for reading the article so carefully and suggesting a correction, this will help the other readers.
Secondly, this article is a modification from one of the consulting project I did a few years ago. Of course the numbers are completely fuzzed up and trends are modified to make a stronger case in the article. But, I guess 60 years old theory by Modigliani does not hold true in modern urban India for a specific socio-economic strata. I am also not surprised because theories in economics are rarely universal. Additionally, FOIR is only a rough proxy for consumption because it only takes into account liabilities from loans (I must also point out senior generation in India is loan averse). However, I would love to know from your experience if you have seen some live applications of this theory.
Additionally, I have noticed on numerous occasions that adding variables with high correlation rarely provide significant predictive power to the model at the cost of adding additional variable (Occam’s razor). One way to handle multicollinearity is of course by creating composite variables through Principle Component Analysis (PCA). However, if possible I always prefer to play around and create some interesting ratio variables using one’s own hypothesis to tackle multicollinearity – it makes you feel clever for a few days, a little harmless vanity we can all live with 🙂
Roopam thank you for the response. I am glad you pointed out the part about modern India because this is just why I am starting to follow this blog so as to gain positive synergy from a wider world view. (Also, some of my examples are met with blank stares by my S. Asian & Chinese colleagues!)
Yes, econ is a social science so that theory and actuality are hardly automatically (arguably sometimes) the same as, say, in astronomy. Plus, I have taken it a bit out of context. It took a mere second to find this example from NZ:
Professionally, I would have to strain my mind to think of many aggregate examples where this relationship is statistically significant. I think the US rule-of-thumb is that retirees should expect to live on an income of 60% of their “working age” income. We probably don’t observe bad rates for those below a certain income threshold for example. A younger person may “need” a fancy car while and have to borrow while an older person could pay cash but is content with memories of their cars from long ago perhaps.
What really matters is the target market’s profile so that high tech goods and services might behave quite differently.
I am fascinated by your point about loan aversion among older folks as the same is true here, but I gather not as strong as in other parts of the world—Chinese colleagues tell me that their parents (who are not too old) often are shocked to find out that their transplanted-to-US offspring have home mortgages. We often see regional and religious differences as well so that US Midwesterners (N. Dakota/Minnesota) are slightly less likely to carry credit card balances than someone from Nevada or Florida for example.
I am curious about what rule-of-thumb you use to diagnose multicollinearity, apparently below .76. What is “high” correlation? PCA variables are hard to interpret for the senior exec or marketing set and I have little experience there.
Ratio variables are useful—this is the real detective work.
In most Indian banks / financial companies, for salaried customers, you will rarely find borrowers above 60 (i.e. retired professionals). This could be attributed to both lending policies and aversion of retired professionals to apply for loans. Hence, the last two buckets in the above data set from NZ could be easily removed in Indian banking scenario (though the above trend will hold true if we will look at the Indian population as a whole that has people above 60). Banking data is certainly not a complete representation of the country’s population (as it is a bit skewed because of lending policies). This could be the reason for high correlation between age and income.
About multicollinearity, in theory one would consider a VIF above 5 or bivariate correlation above 0.9 as a definite red flag for multicollinearity. However, how often in a data mining problem one would come across a correlation so high. These rules, I think, are for traditional statistical problem with 30 to 100 data points. Hence, one has to create one’s own rules of thumb. I often look at both correlation and additional predictive power the second variable is offering for the model (I know this is not standard textbook definition of multicollinearity). It could be that just by including one variable you would achieve the same predictive power as with including both the variables. I generally am a huge fan of Occum’s Razor.
Analysts like detectives bring in their own personality in investigation! Dean, I am enjoying our conversation, thanks for raising some really good points.
I, too, am enjoying our conversation. Do feel free to call me Gordy as all my colleagues/friends do.
Quite an interesting point about lending policies for the over 60 set. Is there any “age discrimination” legislation on denial of loans to older borrowers due to business rules? In the US, there is such legislation but the Fed has found that for many models de facto age discrimination occurs due to cross correlated variables. We also see an increase in the labor force participation rate so that a much larger share of the older 65s are actually working or are active investors. Remember too that I admitted (or meant to) that the NZ data and Modigliani apply most strongly to the population as a whole so that in business practice we are not observing low income folks & here (possibly unlike India) many elders are alone and quite poor—not really pertinent to our detective case. Further, if we are looking at the super affluent population only, my opinion is that Modigliani does not hold well.
Since I am often working with transactions data with class variables 0.9 is not infrequently seen even with observations in the hundreds of thousands +. Although a frequentist, I am often “ordered” not to sample & here is where I am most interested in your thinking as data mining has a new connotation with today’s “big data” than it does/did when prescriptive econometricians ruled my part of the continuum.
Agree about Occam’s razor as we were told not to “data mine” (under an older definition of the term) that really meant both over-sampling and a lack of parsimony in modeling.
You are right that the extra “lift” or prediction is what it is all about but to me age & income together in a model (which are often non-linearly related) is not the same as using 200 variables to over-sample for a model that could have performed nearly as well as 25. Plus, who says we have to use stat models when a neural net or one of these new topological constructions can do a better job!
Detectives who are too uni-dimensional in our N-space data world don’t obtain positive synergy and can solve cases as quickly or as well in many cases.
Unlike the US, we don’t have very strong laws against discrimination by lenders in India. In fact lenders don’t even need to justify their reasons for rejection to the borrower. Lending is a highly agent driven exercise in India. These sales agents usually know which lender is most likely to approve the borrower’s profile. As I mentioned for the longest time loans were taboo in India, though the things are changing very fast (am not sure if it is all for better).
Gordy since you mentioned non-linearity; I am a huge fan of non-linearity. Incidentally, chaos theory, fractal geometry and non-linear dynamics are among my favorite topics. The world is a beautiful place because of non-linearity. I am getting a decent hang of neural nets and SVM because of the last couple of projects I did, in addition to self learning. Though, I still think things are much easier to explain in the linear world of regression. Could you recommend some books / videos / tutorials for more intuitive understanding of the new topological constructs? It is always good for detectives to be fully aware of the new techniques for investigation. DNA sampling for that matter did complement well to the power of magnifying glass.
A very good reading, thanks. True but most banking authority refuse loans to senior citizen as they are in the edge, not that we seniors dont ask for, my personal experience says.
@Sujata: Banking in India is not as transparent as one would like it to be. Typically banks only give loans to seniors as a co-applicant with their children. That’s a shame.
Hi Roopam,
I just loved this article. The language is so pristine and unambiguous. This article inspired me to update my knowledge in data analysis. Although, my work is different than data scientist I can always use data analysis. I hope you will write more articles like this.
Hi Roopam , If possible can we expect a blog on financial inclusion in developing countries. It would be great if it is exclusively on India.