Modeling in Advanced Analytics
The room, full of Analysts, erupts with a loud round of laughter when a young business analyst narrates to us an incident from his recent trip back home. A distant aunt inquired about his new profession. His response – I am into modeling. She got all excited and asked – is it just on the ramp or will I see you on the television? Jokes apart, this left me wondering about the roots of the word modeling or model. What is a model?
A model is defined as a simplified representation of reality. A representation of reality, hmmm, a photograph is a representation of reality – a moment of reality capture on the reel – does that makes it into a model. I think yes. Similarly, a newspaper reporter covering an incident and makes it into breaking news is also a model – a descriptive model. Now, let us try to link models with Analytics.
Data warehouse, Business Intelligence and Advanced Analytics
Analytics has received a massive boost because of the emergence of information technology. We are living in the era of big data. A plethora of data collected at every stage of the business process had created a need to extract knowledge out of the information. This overall process has three aspects to it
1. Data warehouse or data marts: transactional data is extracted-transformed and loaded (ETL) into a data model / schema for the purpose of analysis
2. Business Intelligence or dashboards: “as is” business reports
3. Predictive Analytics or Advanced Analytics: high-end statistical and data mining exercise
As the quantum of data is exponentially increasing, Hadoop and big data technologies are replacing the data warehouses. However, the thought process for business intelligence and predictive analytics – the focus of this article – will not change much. Let me try to distinguish between business intelligence and predictive Analytics using something I learned at a professional theater.
5Ws for business intelligence & predictive Analytics – Lessons from Theater
I joined a professional theater group a few years ago. To understand the nuances of acting we started with improv or improvisation theater. This form of theater does not have a predefined script but the actors built the story while performing. Most people thought I was a good improv actor. However, the style of remembering dialogue while performing did not work very well for me and hence it was the end of my theater gig. However, I learn some good lessons from the whole experience. One of them was the five-Ws of deciphering a character to build the drama.
1. What had happened?
2. When did it happen?
3. Where did it happen?
4. Who was part of this?
5. Why did it happen?
Clearly, the first four questions are trying to report an as-is version of the reality – a descriptive model. This is exactly what the business intelligence professionals try to achieve through the fancy reporting platforms & software. The fifth question is the trickiest of the lot. The question that keeps scientists and inquisitive minds awake late at night.
Newton’s Legacy
An apple falls from a tree. How difficult is it to answer the first four questions? Most of us can answer them with a help of a clock and a map. However, Isaac Newton answered the fifth question and his answer – Gravity. If he had stopped there, nobody would have remembered him after close to four hundred years since his birth. He gave a mathematical model to explain this phenomenon.
Replace apple and earth with any other objects and you have the general equation for the model. Albert Einstein did shatter the Newtonian notion of Gravity. However, this model still holds good for all problems of practical purposes and used extensively in rocket science.
Advanced analytics tries to facilitate the answer to the fifth question of why did something happen using predictive modeling. The combination of high-end statistical and data mining techniques along with analysts’ business acumen produces models that help organizations make informed decisions. Remember, this is just the beginning and causality is still a fair distance!
Credit Scoring Models
Credit scorecards are models to predict the probability of a borrower default on his/her loan. The following is a simplified version of credit score with three variables
Credit Score = Age + Loan to Value Ratio (LTV) + Installment (EMI) to Income Ratio (IIR)
A 28-year-old man with the LTV of 75 and the IIR of 60 will have the score of 10+50+5 =65 and hence is a high credit risk.
Now the question is, how did we arrive at the bucket-wise score points and associated risk tables? By now, after going through the previous three articles of the series, you must have some idea how we will go about it. We have a historical list of good / bad borrowers (article 2) that we want to distinguish using predictor variables (article 3). There are several statistical & data mining techniques that could help us achieve our object such as
1. Decision tree
2. Neural Networks
3. Support Vector Machines
4. Probit Regression
5. Linear discriminant analysis
6. Logistic Regression
Logistic regression is the most commonly used technique for the purpose. We will explore more about logistic regression in the next article.
Sign-off Note
I must conclude this article by saying that the good analysts find a good mathematical model as beautiful as the model walking on the catwalk ramp.
References 1. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring – Naeem Siddiqi 2. Credit Scoring for Risk Managers: The Handbook for Lenders – Elizabeth Mays and Niall Lynas
The question of why did it happen? Don’t know if that can be answered by the data in isolation, but rather a subject matter expert should pose a series of hypothesis, that we could perhaps prove true or false with the data.
Richard cannot agree more with you on this one. Human intelligence stills hold the key to unlock the big question of ‘why’. Data is just facts or evidence; you still need Sherlock Holmes to unravel the mystery. The process requires asking the right questions, setting testable hypotheses, designing experiments, data gathering…And in case of Sherlock Holmes seeing things which others have blind spot for.
I must also point out, after all this science can only answer why to a certain degree, being wary that the notion of certitude could anytime be demolished. As happened with Newtonian mechanics. I think, only a few areas of mathematics enjoy guarantied certitude. For example, Pythagoras theorem is and will be true until eternity. However, for most practical problems a degree of certainty science present is good enough to take the field forward or even transform it for good.
I would have singled out human brain as the only object in our known universe to chase the question of ‘why’ through logic and hypotheses. However, my wife pointed out last week a ‘Robot scientist named Adam does every bit a graduate (PhD) student is capable of.
The following is taken directly from a WikiPedia article on Adam
– hypothesizing to explain observations
– devising experiments to test these hypotheses
– physically running the experiments using laboratory robotics
– interpreting the results from the experiments
– repeating the cycle as required
I guess we humans need to be more creative in our approach for scientific knowledge as now robots are competing with us!
Hi Roopam,
I had started reading your posts a few months ago and i enjoy gaining knowledge from your posts and i started out from Part 1 but got busy with work in between and could not catch up. I can see posts from 4-7 on the scorecard development series, where do i find the older posts?
Regards,
Afzal.
I found it, sorry my browser was having problems.. Thanks and keep up the good work ! http://ucanalytics.com/blogs/author/roopam/
Thanks Afzal
Hi Roopam,
I like the way u explain things with examples. In spite of having a decent knowledge on these areas, I still love going through them:-)
Great work Roopam….
how you made the table of “score points wise risk”? what is it criteria?
The scores are created using linear transformation of output from logistic regression. I suggest you read the following case study to get a better idea of this process. Link to the case study
can you please make a topic about predictive analytics?
Most of the cases on this site are about predictive analytics, so please go through the cases to get a good understanding of predictive analytics and data science.
Hi Roopam,
I am a big fan of ur website and blogs and was going through this Case Study of Credit Risk Modeling and found it really interesting, but i hava a small doubt reg the credit score.
Credit Score = Age + Loan to Value Ratio (LTV) + Installment (EMI) to Income Ratio (IIR)
As per the below equation “A 28-year-old man with the LTV of 75 and the IIR of 60 will have the score of 10+50+75(As he is falling under 20-50 Age group) =135 and not 10+50+5 =65 and hence is a high credit risk.
Please correct me if am wrong. Thanks:)
Thanks, Shashikant. I am glad you enjoy YOU CANalytics.
You are reading the wrong column for age. This person falls in ‘below 32 years’ bucket for age and hence gets 10 points for that. 20-50 is the IIR bucket, not age.
Hi Roopam,
Enjoyed reading your blogs, though I have a decent knowledge on this areas, the way it had been explained makes the reading much more interesting, I have one suggestion, Appreciate if the code is also explained side by side when explaining the alogorithms