A Scientist & An Artist
A few weeks ago while wandering around in Florence, the birthplace of the Renaissance, I could not escape the thought of Leonardo da Vinci : the greatest polymath of all times. Leonardo’s illustrious resume contains titles such as painter, inventor, physicist, astronomer, engineer, biologist, anatomist, geologist, and architect – no kidding! A smart cat would have to live all her nine lives to acquire the nine titles Leonardo had mastered in one lifetime. Today, while discussing facets of data visualization, we should pay homage to Uncle Leonardo as we cross the realm of both art and science.
Art and Science of Data Visualization
Data visualization, as mentioned earlier, is both art and science. I personally prefer to have a long look at the data, plotting them in various ways before jumping into rigorous mathematical modeling. You might have noticed my penchant for art while going through my artwork presented in all the posts on this blog. The saying – a picture is worth thousand words – holds true during data analysis as well. Models in analytics can go horribly wrong if you have not spent enough time on the data exploratory phase – which is all about data visualization to me. Let me present a case study example to explain the aspects of data visualization during the exploratory phase.
Banking Case Study Example – Risk Management
Assume you are the chief risk officer (CRO) for CyndiCat bank that has disbursed 60816 auto loans in the quarter between April–June 2012. Today, about a year and a quarter since the loans disbursal, you know that the loans have seasoned or bad loans are tagged to a greater certainty (read a detailed discussion). You have noticed a bad rate of around 2.5% or 1524 bad loans out of total 60816 disbursed loans.
Before you jump to multivariate analysis and credit scoring (read a detailed discussion on credit scoring), you want to analyze the bad rate across several individual variables. You have a hunch based on your experience that borrower’s age at the time of loan disbursal is a key distinguishing factor for bad rates. Therefore, you have divided the loans based on the age of the borrowers and created a table something like the one below.
Using the above table, you have created a histogram and zoomed into the area of interest (close to the bad loans) as shown in the plots below.
You must have noticed the following
• The distribution of loans across age groups is a reasonably smooth normally distributed curve, without too many outliers. Age often display this kind of pattern for most products. However, do not expect similar smooth curves for other commonly found variables in a business scenario. Often, you may have to resolve to variable transformation to make the distributions smooth.
• The maximum bad loans are in the age bucket 42 to 45 years. This certainly does not mean the risk is also the highest in this bucket, however, once I have heard someone drawing a similar conclusion in a quarterly business review meeting –a silly mistake. Note, the maximum loans are also in the bucket 42 to 45 years. Absolute numbers do not provide enough information hence we need to create a normalized plot.
• The data is really thin on the fringe buckets (i.e. <21 and >60 years groups) with only 9 and 6 data points – be careful when dealing with such thin data. Sound business knowledge to modify these fringe buckets is extremely helpful while a model development. For instance, you know that for age above 60 for loans could be highly risky, but in this data, we do not have enough evidence for the same since we do not have enough data to validate our hypothesis. We should supplement a right risk weight in such situation – however, be very careful while doing so.
Normalized Plot
The normalized plot is easy to construct. The idea is to scale each age group to 100% and overlay bad and good percentage of records on top. We could extend the table shown above to get the values for the normalized plot as shown below.
Now, once you have the table ready you could create a normalized plot quite easily, as shown below (again we have zoomed into the plot to get a clear view of bad rates).
These plots are completely different from the original frequency count plot and presenting the information in a completely different light. The following are the things one could conclude from the plots.
• There is a definite trend in terms of the bad rates and the age groups. As the borrowers are getting older, they are less likely to default on their loans. That is a good insight.
• Again, the fringes (i.e. <21 and >60 years groups) have thin data, this information cannot be obtained from the normalized plot. Hence, you need to have the frequency plot handy to treat thin data differently. A handy rule of thumb is to have at least 10 records of both (good & bad) cases before taking the information seriously – otherwise, it is not statistically significant.
I must conclude by saying that, data visualization is the beginning of modeling process and not the destination. However, it is a good & creative beginning.
Sign-off Note
With big data, data analysis tools & technologies, scientific progress and democratic environment – we could be living in the Renaissance of our times. However, we will need more Leonardo da Vincis to make these times really special.
Hi Roopam,
Thanks a lot buddy for giving a good insight on Analytics. Your blogs are really very informative and made me understand the nitty gritty of Analytics. Look forward to your future posts.. God bless you!!!
Thanks,
Amit Chandra
Thanks
Hi Roopam,
yours is easily the most effective guide on predictive analytics i have come across. You made statistics sound as cool as science and engineering.
regards,
Abhishek
Thanks Abhishek, appreciate your kind words.
Hi Roopam,
I enjoy going through each article of you. Incredible work!!
can you help me in understanding, how you decide on the different age bands?
Thanks
Thanks Sourav,
Here age bands were formed using eyeballing. The idea is to notice significant gradient change in average risk with change in bands. You could also use uni-variate decision trees (CHAID) to create bands.
Analytics de Perfiles son fundamentales para enriquecer el Negocio!!, la explotación de datos combinado con los modelos predictivos son el valor agregado al Negocio!!
Muy buenas tus publicaciones!!!! Gracias por compartirlo!!
Here in your graph there is a clear downward trend. What if the data doesn’t have any(increasing/decreasing) trend? In that case WOE values cannot be monotonically increasing or decreasing. What should be done in those cases?
Monotonically decreasing or increasing trend is not the primary requirement for development of models. For instance, age forms a u-shaped plot vs. bad-rate for the banks in developed economies. This is logical since the repayment capability of elderlies is not as strong as for middle aged working professionals. The data used in this case study is for a developing economy where borrowers’ age is fictitiously capped at 60 – I hope you have noticed the thin data for age above 57 years. Hence, the important condition is logical consistency rather than trend line. Regular trend line, in many cases, justifies logical consistency than randomly fluctuating trend.
hi roopam
thanks alot for the good insights of anlaytics
sir i didint got the meaning of thin data which you refered in the article,can you make it clear for me
Thin data refer to data with low or insufficient sample size. This also applies to small sample size for either the good or bad class of loans. You will this article useful : identify the right sample size for your analysis
Thanks Roopam – Your blogs are quite interesting, informative and intuitive.
is there a sample data set for this example that I can use for practice?
Hi Roopam,
Here you have considered Age as the variable to form binning and categorize the date. Lets say there is another variable as Income along with Age variable. This is also a variable which can decide the good and bad rates. How do you decide now the distinguishing factor for bad rates ?
Hi Roopam,
it is a good way to learn analytics. Excellent work!.
Hi Roopam,
Hope you are doing well.
It was one of my best experiences working with you. You are a valuable asset to any Organization and a best guide 🙂
Good luck.
Deepak
Thanks, Deepak! Really appreciate your comment. Be well.
dear sir,
can you share the jupyter notebbok/code you used in this article?
Dear Roopam,
Lets say if i replace original variables by WOE and then check for multicollinearity. Is this the right procedure?
Yes, that’s fine.
Hi Roopam,
Content is awesome. But can we have this data which is used over here.
Much Appreciated,
Ankush
Hi Roopam,
Thanks for the good work work
how can i decrease number of age groups?for example i want to see coefficents for 5 groups